Categorical 
Data  Analysis 

Third  Edition 


AGRESTI 


WWW. 


. 


Categorical  Data  Analysis 


WILEY  SERIES  IN  PROBABILITY  AND  STATISTICS 

Established  by  WALTER  A.  SHEWHART  and  SAMUEL  S.  WILKS 

Editors:  David  J.  Balding,  Noel  A.  C.  Cressie,  Garrett  M.  Fitzmaurice, 

Harvey  Goldstein,  lain  M.  Johnstone,  Geert  Molenherghs,  David  W.  Scott, 

Adrian  F.  M.  Smith,  Ruey  S.  Tsay,  Sanford  Weisberg 

Editors  Emeriti:  Vic  Barnett ,  J.  Stuart  Hunter,  Joseph  B.  Kadane,  JozefL.  Teugels 
A  complete  list  of  the  titles  in  this  series  appears  at  the  end  of  this  volume. 


Categorical  Data  Analysis 

Third  Edition 


ALAN  AGRESTI 

Department  of  Statistics 
University  of  Florida 
Gainesville,  Florida 


(\J/)  WI  LEY- 1 NTE  RSCI ENCE 

A  JOHN  WILEY  &  SONS,  INC.,  PUBLICATION 


Cover  Image:  (background)  Peter  Firus/iStockphoto,  (line  art)  courtesy  of  the  author 


Copyright  ©  201 3  by  John  Wiley  &  Sons.  All  rights  reserved. 

Published  by  John  Wiley  &  Sons,  Inc.,  Hoboken,  New  Jersey 
Published  simultaneously  in  Canada 

No  part  of  this  publication  may  be  reproduced,  stored  in  a  retrieval  system,  or  transmitted  in  any  form  or  by  any 
means,  electronic,  mechanical,  photocopying,  recording,  scanning,  or  otherwise,  except  as  permitted  under 
Section  107  or  108  of  the  1976  United  States  Copyright  Act,  without  either  the  prior  written  permission  of  the 
Publisher,  or  authorization  through  payment  of  the  appropriate  per-copy  fee  to  the  Copyright  Clearance  Center, 
Inc.,  222  Rosewood  Drive,  Danvers,  MA  01923,  978-750-8400,  fax  978-750-4470,  or  on  the  web  at 
www.copyright.com.  Requests  to  the  Publisher  for  permission  should  be  addressed  to  the  Permissions 
Department,  John  Wiley  &  Sons,  Inc.,  1 1 1  River  Street,  Hoboken,  NJ  07030,  201-748-601 1,  fax  201-748-6008, 
or  online  at  http://www.wiley.com/go/permission. 

Limit  of  Liability/Disclaimer  of  Warranty:  While  the  publisher  and  author  have  used  their  best  efforts  in 
preparing  this  book,  they  make  no  representations  or  warranties  with  respect  to  the  accuracy  or  completeness  of 
the  contents  of  this  book  and  specifically  disclaim  any  implied  warranties  of  merchantability  or  fitness  for  a 
particular  purpose.  No  warranty  may  be  created  or  extended  by  sales  representatives  or  written  sales  materials. 
The  advice  and  strategies  contained  herein  may  not  be  suitable  for  your  situation.  You  should  consult  with  a 
professional  where  appropriate.  Neither  the  publisher  nor  author  shall  be  liable  for  any  loss  of  profit  or  any  other 
commercial  damages,  including  but  not  limited  to  special,  incidental,  consequential,  or  other  damages. 

For  general  information  on  our  other  products  and  services  or  for  technical  support,  please  contact  our  Customer 
Care  Department  within  the  United  States  at  800-762-2974,  outside  the  United  States  at  317-572-3993  or 
fax  317-572-4002. 

Wiley  also  publishes  its  books  in  a  variety  of  electronic  formats.  Some  content  that  appears  in  print  may  not  be 
available  in  electronic  formats.  For  more  information  about  Wiley  products,  visit  our  web  site  at  www.wiley.com. 

Library  of  Congress  Cataloging-in-Publication  Data 

Agresti,  Alan. 

Categorical  data  analysis  /  Alan  Agresti.  -  3rd  ed. 

p.  cm.  -  (Wiley  series  in  probability  and  statistics;  792) 

Includes  bibliographical  references  and  index. 

ISBN  978-0-470-46363-5  (hardback) 

1 .  Multivariate  analysis.  I.  Title. 

QA278.A353  2013 
5 19.5'35-dc23 

2012009792 


Printed  in  the  United  States  of  America 


10  987654321 


To  Jacki 


Contents 


Preface 

1  Introduction:  Distributions  and  Inference  for  Categorical  Data 

1 . 1  Categorical  Response  Data,  1 

1 .2  Distributions  for  Categorical  Data,  5 

1 .3  Statistical  Inference  for  Categorical  Data,  8 

1.4  Statistical  Inference  for  Binomial  Parameters,  13 

1.5  Statistical  Inference  for  Multinomial  Parameters,  17 

1 .6  Bayesian  Inference  for  Binomial  and  Multinomial  Parameters,  22 
Notes,  27 

Exercises,  28 

2  Describing  Contingency  Tables 

2. 1  Probability  Structure  for  Contingency  Tables,  37 

2.2  Comparing  Two  Proportions,  43 

2.3  Conditional  Association  in  Stratified  2x2  Tables,  47 

2.4  Measuring  Association  in  /  x  J  Tables,  54 
Notes,  60 

Exercises,  60 

3  Inference  for  Two-Way  Contingency  Tables 

3.1  Confidence  Intervals  for  Association  Parameters,  69 

3.2  Testing  Independence  in  Two-way  Contingency  Tables,  75 

3.3  Following-up  Chi-Squared  Tests,  80 

3.4  Two-Way  Tables  with  Ordered  Classifications,  86 

3.5  Small-Sample  Inference  for  Contingency  Tables,  90 

3.6  Bayesian  Inference  for  Two-way  Contingency  Tables,  96 

3.7  Extensions  for  Multiway  Tables  and  Nontabulated  Responses,  100 
Notes,  101 

Exercises,  103 


Vlll 


CONTENTS 


4  Introduction  to  Generalized  Linear  Models  113 

4. 1  The  Generalized  Linear  Model,  1 1 3 

4.2  Generalized  Linear  Models  for  Binary  Data,  1 17 

4.3  Generalized  Linear  Models  for  Counts  and  Rates,  122 

4.4  Moments  and  Likelihood  for  Generalized  Linear  Models,  1 30 

4.5  Inference  and  Model  Checking  for  Generalized  Linear  Models,  1 36 

4.6  Fitting  Generalized  Linear  Models,  143 

4.7  Quasi-Likelihood  and  Generalized  Linear  Models,  149 
Notes,  152 

Exercises,  153 

5  Logistic  Regression  163 

5.1  Interpreting  Parameters  in  Logistic  Regression,  163 

5.2  Inference  for  Logistic  Regression,  169 

5.3  Logistic  Models  with  Categorical  Predictors,  175 

5.4  Multiple  Logistic  Regression,  182 

5.5  Fitting  Logistic  Regression  Models,  192 
Notes,  195 

Exercises,  196 

6  Building,  Checking,  and  Applying  Logistic  Regression  Models  207 

6. 1  Strategies  in  Model  Selection,  207 

6.2  Logistic  Regression  Diagnostics,  215 

6.3  Summarizing  the  Predictive  Power  of  a  Model,  221 

6.4  Mantel-Haenszel  and  Related  Methods  for  Multiple  2x2  Tables,  225 

6.5  Detecting  and  Dealing  with  Infinite  Estimates,  233 

6.6  Sample  Size  and  Power  Considerations,  237 
Notes,  241 

Exercises,  243 

7  Alternative  Modeling  of  Binary  Response  Data  251 

7. 1  Probit  and  Complementary  Log-log  Models,  251 

7.2  Bayesian  Inference  for  Binary  Regression,  257 

7.3  Conditional  Logistic  Regression,  265 

7.4  Smoothing:  Kernels,  Penalized  Likelihood,  Generalized 
Additive  Models,  270 

7.5  Issues  in  Analyzing  High-Dimensional  Categorical  Data,  278 
Notes,  285 

Exercises,  287 


CONTENTS 


IX 


8  Models  for  Multinomial  Responses  293 

8.1  Nominal  Responses:  Baseline-Category  Logit  Models,  293 

8.2  Ordinal  Responses:  Cumulative  Logit  Models,  301 

8.3  Ordinal  Responses:  Alternative  Models,  308 

8.4  Testing  Conditional  Independence  in  I  x  J  x  K  Tables,  314 

8.5  Discrete-Choice  Models,  320 

8.6  Bayesian  Modeling  of  Multinomial  Responses,  323 
Notes,  326 
Exercises,  329 

9  Loglinear  Models  for  Contingency  Tables 

9.1  Loglinear  Models  for  Two-way  Tables,  339 

9.2  Loglinear  Models  for  Independence  and  Interaction  in  Three-way 
Tables,  342 

9.3  Inference  for  Loglinear  Models,  348 

9.4  Loglinear  Models  for  Higher  Dimensions,  350 

9.5  Loglinear — Logistic  Model  Connection,  353 

9.6  Loglinear  Model  Fitting:  Likelihood  Equations  and  Asymptotic 
Distributions,  356 

9.7  Loglinear  Model  Fitting:  Iterative  Methods  and  Their  Application, 

Notes,  368 
Exercises,  369 

10  Building  and  Extending  Loglinear  Models  377 

10.1  Conditional  Independence  Graphs  and  Collapsibility,  377 

10.2  Model  Selection  and  Comparison,  380 

10.3  Residuals  for  Detecting  Cell-Specific  Lack  of  Fit,  385 

10.4  Modeling  Ordinal  Associations,  386 

10.5  Generalized  Loglinear  and  Association  Models,  Correlation  Models, 
and  Correspondence  Analysis,  393 

10.6  Empty  Cells  and  Sparseness  in  Modeling  Contingency  Tables,  398 

10.7  Bayesian  Loglinear  Modeling,  401 
Notes,  404 

Exercises,  407 

11  Models  for  Matched  Pairs  413 

11.1  Comparing  Dependent  Proportions,  414 

11.2  Conditional  Logistic  Regression  for  Binary  Matched  Pairs,  418 

1 1 .3  Marginal  Models  for  Square  Contingency  Tables,  424 

11.4  Symmetry,  Quasi-Symmetry,  and  Quasi-Independence,  426 


339 


364 


X 


CONTENTS 


1 1 .5  Measuring  Agreement  Between  Observers,  432 

1 1 .6  Bradley-Terry  Model  for  Paired  Preferences,  436 

1 1 .7  Marginal  Models  and  Quasi-Symmetry  Models  for  Matched  Sets,  439 
Notes,  443 

Exercises,  445 

12  Clustered  Categorical  Data:  Marginal  and  Transitional  Models  455 

12.1  Marginal  Modeling:  Maximum  Likelihood  Approach,  456 

12.2  Marginal  Modeling:  Generalized  Estimating  Equations  (GEEs) 

Approach,  462 

12.3  Quasi-Likelihood  and  Its  GEE  Multivariate  Extension:  Details,  465 

12.4  Transitional  Models:  Markov  Chain  and  Time  Series  Models,  473 
Notes,  478 

Exercises,  479 

13  Clustered  Categorical  Data:  Random  Effects  Models  489 

13.1  Random  Effects  Modeling  of  Clustered  Categorical  Data,  489 

13.2  Binary  Responses:  Logistic-Normal  Model,  494 

13.3  Examples  of  Random  Effects  Models  for  Binary  Data,  498 

13.4  Random  Effects  Models  for  Multinomial  Data,  511 

13.5  Multilevel  Modeling,  515 

13.6  GLMM  Fitting,  Inference,  and  Prediction,  519 

13.7  Bayesian  Multivariate  Categorical  Modeling,  523 
Notes,  525 

Exercises,  527 

14  Other  Mixture  Models  for  Discrete  Data  535 

14.1  Latent  Class  Models,  535 

14.2  Nonparametric  Random  Effects  Models,  542 

14.3  Beta-Binomial  Models,  548 

14.4  Negative  Binomial  Regression,  552 

1 4.5  Poisson  Regression  with  Random  Effects,  555 
Notes,  557 

Exercises,  558 

15  Non-Model-Based  Classification  and  Clustering  565 

15.1  Classification:  Linear  Discriminant  Analysis,  565 

15.2  Classification:  Tree-Structured  Prediction,  570 

15.3  Cluster  Analysis  for  Categorical  Data,  576 
Notes,  581 

Exercises,  582 


CONTENTS 


XI 


16  Large-  and  Small-Sample  Theory  for  Multinomial  Models  587 

16.1  Delta  Method,  587 

16.2  Asymptotic  Distributions  of  Estimators  of  Model  Parameters  and  Cell 
Probabilities,  592 

16.3  Asymptotic  Distributions  of  Residuals  and  Goodness-of-fit  Statistics,  594 

16.4  Asymptotic  Distributions  for  Logit/Loglinear  Models,  599 

16.5  Small-Sample  Significance  Tests  for  Contingency  Tables,  601 

16.6  Small-Sample  Confidence  Intervals  for  Categorical  Data,  603 

16.7  Alternative  Estimation  Theory  for  Parametric  Models,  610 
Notes,  615 

Exercises,  616 

17  Historical  Tour  of  Categorical  Data  Analysis  623 

17.1  Pearson- Yule  Association  Controversy,  623 

17.2  R.  A.  Fisher’s  Contributions,  625 

17.3  Logistic  Regression,  627 

17.4  Multiway  Contingency  Tables  and  Loglinear  Models,  629 

17.5  Bayesian  Methods  for  Categorical  Data,  633 

17.6  A  Look  Forward,  and  Backward,  634 


Appendix  A 

Statistical  Software  for  Categorical  Data  Analysis 

637 

Appendix  B 

Chi-Squared  Distribution  Values 

641 

References 

643 

Author  Index 

689 

Example  Index 

701 

Subject  Index 

705 

Appendix  C  Software  Details  for  Text  Examples  (text  website) 

(www. stat . uf 1 . edu/~aa/cda/cda . html) 

Appendix  D  Solutions  to  Selected  Exercises  (text  website) 
(www. stat ,ufl . edu/~aa/cda/cda .html) 


Preface 


The  explosion  in  the  development  of  methods  for  analyzing  categorical  data  that  began  in 
the  1960s  has  continued  apace  in  recent  years.  This  book  provides  an  overview  of  these 
methods,  as  well  as  older,  now  standard,  methods.  It  gives  special  emphasis  to  generalized 
linear  modeling  techniques,  which  extend  linear  model  methods  for  continuous  variables, 
and  their  extensions  for  multivariate  responses. 


OUTLINE  OF  TOPICS 

Chapters  1-10  present  the  core  methods  for  categorical  response  variables.  Chapters  1-3 
cover  distributions  for  categorical  responses  and  traditional  methods  for  two-way  contin¬ 
gency  tables.  Chapters  4-8  introduce  logistic  regression  and  related  models  such  as  the 
probit  model  for  binary  and  multicategory  response  variables.  Chapters  9  and  10  cover 
loglinear  models  for  contingency  tables. 

In  the  past  quarter  century,  a  major  area  of  new  research  has  been  the  development  of 
methods  for  repeated  measurement  and  other  forms  of  clustered  categorical  data.  Chapters 
11-14  present  these  methods,  including  marginal  models  and  generalized  linear  mixed 
models  with  random  effects.  Chapter  15  introduces  non-model-based  methods  for  classi¬ 
fication  and  clustering.  Chapter  16  presents  theoretical  foundations  as  well  as  alternatives 
to  the  maximum  likelihood  paradigm  that  this  text  adopts.  Chapter  17  is  devoted  to  a 
historical  overview  of  the  development  of  the  methods.  It  examines  contributions  of  noted 
statisticians,  such  as  Pearson  and  Fisher,  whose  pioneering  efforts — and  sometimes  vocal 
debates — broke  the  ground  for  this  evolution. 

Appendices  illustrate  the  use  of  statistical  software  for  analyzing  categorical  data.  The 
website  for  the  text,  www.stat.ufl.edu/~aa/cda/cda.html,  contains  an  appendix 
with  detailed  examples  of  the  use  of  software  (especially  R,  S  AS,  and  Stata)  for  performing 
the  analyses  in  this  book,  solutions  to  many  of  the  exercises,  extra  exercises,  and  corrections. 


CHANGES  IN  THIS  EDITION 

Given  the  explosion  of  research  in  the  past  50  years  on  categorical  data  methods,  it  is  an 
increasing  challenge  to  write  a  comprehensive  book  covering  all  the  commonly  used  meth¬ 
ods.  The  second  edition  of  this  book  already  exceeded  700  pages.  In  including  much  new 

xiii 


xiv 
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material  without  letting  the  book  grow  much,  I  have  necessarily  had  to  make  compromises 
in  depth  and  use  relatively  simple  examples.  I  try  to  present  a  broad  overview,  while  pre¬ 
senting  bibliographic  notes  with  many  references  in  which  the  reader  can  find  more  details. 
In  attempting  to  make  the  book  relatively  comprehensive  while  presenting  substantive  new 
material,  every  chapter  of  the  first  two  editions  has  been  extensively  rewritten.  The  major 
changes  are: 

•  A  new  Chapter  7  presents  alternative  methods  for  binary  response  data,  including 
some  regularization  methods  that  are  becoming  popular  in  this  age  of  massive  data 
sets  with  enormous  numbers  of  variables. 

•  A  new  Chapter  15  introduces  non- model-based  methods  of  classification,  such  as 
linear  discriminant  analysis  and  classification  trees,  and  cluster  analysis. 

•  Many  chapters  now  include  a  section  describing  the  Bayesian  approach  for  the  meth¬ 
ods  of  that  chapter.  We  also  have  added  material  (e.g.,  Sections  6.5  and  7.4)  about  ways 
that  frequentist  methods  can  deal  with  awkward  situations  such  as  infinite  maximum 
likelihood  estimates. 

•  The  use  of  various  software  for  categorical  data  methods  is  discussed  at  a  much  ex¬ 
panded  website  for  the  text,  www.  stat .  uf  1 .  edu/~aa/cda/cda .  html.  Examples 
are  shown  of  the  use  of  R,  SAS,  and  Stata  for  most  of  the  examples  in  the  text,  and 
there  is  discussion  also  about  SPSS,  StatXact,  and  other  software.  That  website  also 
contains  many  of  the  text’s  data  sets,  some  of  which  have  only  excerpts  shown  in  the 
text  itself,  as  well  as  solutions  for  many  exercises  and  corrections  of  errors  found  in 
early  printings  of  the  book.  I  recommend  that  you  refer  to  this  appendix  (or  special¬ 
ized  software  manuals)  while  reading  the  text,  perhaps  printing  the  pages  about  the 
software  you  prefer,  as  an  aid  to  implementing  the  methods.  This  material  was  placed 
at  the  website  partly  because  the  text  is  already  so  long  without  it  and  also  because  it 
is  then  easier  to  keep  the  presentation  up-to-date. 

In  this  text,  I  interpret  categorical  data  analysis  to  refer  to  methods  for  categorical 
response  variables.  For  most  methods,  explanatory  variables  can  be  categorical  or  quan¬ 
titative,  as  in  ordinary  regression.  Thus,  the  focus  is  intended  to  be  more  general  than 
contingency  table  analysis,  although  for  simplicity  of  data  presentation,  most  examples  use 
contingency  tables.  These  examples  are  simplistic,  but  should  help  you  focus  on  under¬ 
standing  the  methods  themselves  and  make  it  easier  for  you  to  replicate  results  with  your 
favorite  software. 

Other  special  features  of  the  text  include: 

•  More  than  100  analyses  of  data  sets. 

•  About  600  exercises,  some  directed  toward  theory  and  methods  and  some  toward 
applications  and  data  analysis. 

•  Notes  at  the  end  of  each  chapter  that  provide  references  for  recent  research  and  many 
topics  not  covered  in  the  text,  linked  to  a  bibliography  of  more  than  1200  sources. 


INTENDED  AUDIENCE  AND  USE  AS  A  TEXTBOOK 

I  intend  this  book  to  be  accessible  to  the  diverse  mix  of  students  who  take  graduate-level 
courses  in  categorical  data  analysis.  But  I  have  also  written  it  with  practicing  statisticians 


PREFACE 


XV 


and  biostatisticians  in  mind.  I  hope  it  enables  them  to  catch  up  with  recent  advances  and 
learn  about  methods  that  sometimes  receive  inadequate  attention  in  the  traditional  statistics 
curriculum. 

The  development  of  new  methods  has  influenced — and  been  influenced  by — the  in¬ 
creasing  availability  of  data  sets  with  categorical  responses  in  the  social,  behavioral,  and 
biomedical  sciences,  as  well  as  in  public  health,  genetics,  ecology,  education,  marketing  and 
the  financial  industry,  and  industrial  quality  control.  And  so,  although  this  book  is  directed 
mainly  to  statisticians  and  biostatisticians,  I  also  aim  for  it  to  be  helpful  to  methodologists 
in  these  fields. 

Readers  should  possess  a  background  that  includes  regression  and  analysis  of  variance 
models,  as  well  as  maximum  likelihood  methods  of  statistical  theory.  Those  not  having 
much  theory  background  should  be  able  to  follow  most  methodological  discussions.  Those 
with  mainly  applied  interests  can  skip  most  of  Chapter  4  on  the  theory  of  generalized  linear 
models  and  proceed  to  other  chapters.  However,  the  book  has  a  distinctly  higher  technical 
level  and  is  more  thorough  and  complete  than  my  lower-level  text,  An  Introduction  to 
Categorical  Data  Analysis,  Second  Edition  (Wiley,  2007). 

Today,  because  of  the  ubiquity  of  categorical  data  in  applications,  most  statistics  and 
biostatistics  departments  offer  courses  on  categorical  data  analysis  or  on  generalized  linear 
models  with  strong  emphasis  on  methods  for  discrete  data.  This  book  can  be  used  as  a  text 
for  such  courses.  The  material  in  Chapters  1-6  forms  the  heart  of  most  courses.  There  is 
too  much  material  in  this  book  for  a  single  course,  but  a  one-term  course  can  be  based  on 
the  following  outline: 

•  Basic  contingency  table  analysis,  covering  Chapters  1-3,  perhaps  skipping  some 
tangential  sections  such  as  1.5.7,  1.6,  2.4,  3.4-3.7. 

•  Logistic  regression  and  related  methods  for  binary  data,  covering  Chapters  4-6, 
perhaps  skipping  some  tangential  sections  such  as  4.4-4.7  and  6.4-6.6. 

•  Multinomial  response  models,  covering  at  least  Sections  8.1  and  8.2. 

•  Matched  pairs  and  clustered  data,  covering  at  least  Sections  1 1 . 1-1 1 .2. 

Courses  with  biostatistical  orientation  may  want  to  include  bits  from  Chapters  12  and  13 
on  marginal  and  random  effects  models.  Courses  with  social  science  emphasis  may  want 
to  include  some  topics  on  loglinear  modeling  from  Chapters  9  and  10.  Some  courses  may 
want  to  select  specialized  topics  from  Chapter  7,  such  as  probit  modeling,  conditional 
logistic  regression,  Bayesian  binary  data  modeling,  smoothing,  and  issues  in  the  analysis 
of  high-dimensional  data. 
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CHAPTER  1 


Introduction:  Distributions  and 
Inference  for  Categorical  Data 


From  helping  to  assess  the  value  of  new  medical  treatments  to  evaluating  the  factors  that 
affect  our  opinions  and  behaviors,  analysts  today  are  finding  myriad  uses  for  categorical 
data  methods.  In  this  book  we  introduce  these  methods  and  the  theory  behind  them. 

Statistical  methods  for  categorical  responses  were  late  in  gaining  the  level  of  sophistica¬ 
tion  achieved  early  in  the  twentieth  century  by  methods  for  continuous  responses.  Despite 
influential  work  around  1900  by  the  British  statistician  Karl  Pearson,  relatively  little  de¬ 
velopment  of  models  for  categorical  responses  occurred  until  the  1960s.  In  this  book  we 
describe  the  early  fundamental  work  that  still  has  importance  today  but  place  primary 
emphasis  on  more  recent  modeling  approaches. 


1.1  CATEGORICAL  RESPONSE  DATA 

A  categorical  variable  has  a  measurement  scale  consisting  of  a  set  of  categories.  For 
instance,  political  philosophy  is  often  measured  as  liberal,  moderate,  or  conservative.  Diag¬ 
noses  regarding  breast  cancer  based  on  a  mammogram  use  the  categories  normal,  benign, 
probably  benign,  suspicious,  and  malignant. 

The  development  of  methods  for  categorical  variables  was  stimulated  by  the  need  to 
analyze  data  generated  in  research  studies  in  both  the  social  and  biomedical  sciences. 
Categorical  scales  are  pervasive  in  the  social  sciences  for  measuring  attitudes  and  opinions. 
Categorical  scales  in  biomedical  sciences  measure  outcomes  such  as  whether  a  medical 
treatment  is  successful. 

Categorical  data  are  by  no  means  restricted  to  the  social  and  biomedical  sciences.  They 
frequently  occur  in  the  behavioral  sciences  (e.g.,  type  of  mental  illness,  with  the  categories 
schizophrenia,  depression,  neurosis),  epidemiology  and  public  health  (e.g.,  contraceptive 
method  at  last  sexual  intercourse,  with  the  categories  none,  condom,  pill,  IUD,  other), 
genetics  (type  of  allele  inherited  by  an  offspring),  botany  and  zoology  (e.g.,  whether  or 
not  a  particular  organism  is  observed  in  a  sampled  quadrat),  education  (e.g.,  whether  a  stu¬ 
dent  response  to  an  exam  question  is  correct  or  incorrect),  and  marketing  (e.g.,  consumer 
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preference  among  the  three  leading  brands  of  a  product).  They  even  occur  in  highly  quan¬ 
titative  fields  such  as  engineering  sciences  and  industrial  quality  control.  Examples  are  the 
classification  of  items  according  to  whether  they  conform  to  certain  standards,  and  subjec¬ 
tive  evaluation  of  some  characteristic:  how  soft  to  the  touch  a  certain  fabric  is,  how  good  a 
particular  food  product  tastes,  or  how  easy  a  worker  finds  it  to  perform  a  certain  task. 

Categorical  variables  are  of  many  types.  In  this  section  we  provide  ways  of  classifying 
them. 

1.1.1  Response-Explanatory  Variable  Distinction 

Statistical  analyses  distinguish  between  response  (or  dependent)  variables  and  explana¬ 
tory  (or  independent)  variables.  This  book  focuses  on  methods  for  categorical  response 
variables.  As  in  ordinary  regression  modeling,  explanatory  variables  can  be  any  type.  For 
instance,  a  study  might  analyze  how  opinion  about  whether  same-sex  marriages  should  be 
legal  (yes  or  no)  changes  according  to  values  of  explanatory  variables,  such  as  religious 
affiliation,  political  ideology,  number  of  years  of  education,  annual  income,  age,  gender, 
and  race. 

1.1.2  Binary-Nominal-Ordinal  Scale  Distinction 

Many  categorical  variables  have  only  two  categories.  Such  variables,  for  which  the  two 
categories  are  often  given  the  generic  labels  “success”  and  “failure,”  are  called  binary 
variables.  A  major  topic  of  this  book  is  the  modeling  of  binary  response  variables. 

When  a  categorical  variable  has  more  than  two  categories,  we  distinguish  between 
two  types  of  categorical  scales.  Variables  having  categories  without  a  natural  ordering  are 
said  to  be  measured  on  a  nominal  scale  and  are  called  nominal  variables.  Examples  are 
mode  of  transportation  to  get  to  work  (automobile,  bicycle,  bus,  subway,  walk),  favorite 
type  of  music  (classical,  country,  folk,  jazz,  rock),  and  choice  of  residence  (apartment, 
condominium,  house,  other).  For  nominal  variables,  the  order  of  listing  the  categories  is 
irrelevant  to  the  statistical  analysis. 

Many  categorical  variables  do  have  ordered  categories.  Such  variables  are  said  to  be 
measured  on  an  ordinal  scale  and  are  called  ordinal  variables.  Examples  are  social  class 
(upper,  middle,  lower),  political  philosophy  (very  liberal,  slightly  liberal,  moderate,  slightly 
conservative,  very  conservative),  patient  condition  (good,  fair,  serious,  critical),  and  rating 
of  a  movie  for  Netflix  (1  to  5  stars,  representing  hated  it,  didn’t  like  it,  liked  it,  really  liked 
it,  loved  it).  For  ordinal  variables,  distances  between  categories  are  unknown.  Although 
a  person  categorized  as  very  liberal  is  more  liberal  than  a  person  categorized  as  slightly 
liberal,  no  numerical  value  describes  how  much  more  liberal  that  person  is. 

An  interval  variable  is  one  that  does  have  numerical  distances  between  any  two  values. 
For  example,  systolic  blood  pressure  level,  length  of  prison  term,  and  annual  income  are 
interval  variables.  For  most  such  variables,  it  is  also  possible  to  compare  two  values  by 
their  ratio,  in  which  case  the  variable  is  also  called  a  ratio  variable. 

The  way  that  a  variable  is  measured  determines  its  classification.  For  example,  “educa¬ 
tion”  is  only  nominal  when  measured  as  (public  school,  private  school,  home  schooling); 
it  is  ordinal  when  measured  by  highest  degree  attained,  using  the  categories  (none,  high 
school,  bachelor’s,  master’s,  doctorate);  it  is  interval  when  measured  by  number  of  years 
of  education  completed,  using  the  integers  0,  1,  2,  3, . . .. 


CATEGORICAL  RESPONSE  DATA 


3 


A  variable’s  measurement  scale  determines  which  statistical  methods  are  appropriate. 
It  is  usually  best  to  apply  methods  appropriate  for  the  actual  scale.  In  the  measurement 
hierarchy,  interval  variables  are  highest,  ordinal  variables  are  next,  and  nominal  variables 
are  lowest.  Statistical  methods  for  variables  of  one  type  can  also  be  used  with  variables  at 
higher  levels  but  not  at  lower  levels.  For  instance,  statistical  methods  for  nominal  variables 
can  be  used  with  ordinal  variables  by  ignoring  the  ordering  of  categories.  Methods  for 
ordinal  variables  cannot,  however,  be  used  with  nominal  variables,  since  their  categories 
have  no  meaningful  ordering.  The  distinction  between  ordered  and  unordered  categories 
is  not  important  for  binary  variables,  because  ordinal  methods  and  nominal  methods  then 
typically  reduce  to  equivalent  methods. 

In  this  book,  we  present  methods  for  the  analysis  of  binary,  nominal,  and  ordinal 
variables.  The  methods  also  apply  to  interval  variables  having  a  small  number  of  distinct 
values  (e.g„  number  of  times  married,  number  of  distinct  side  effects  experienced  in  taking 
some  drug)  or  for  which  the  values  are  grouped  into  ordered  categories  (e.g.,  education 
measured  as  <  12  years,  >  12  but  <  16  years,  >16  years). 


1.1.3  Discrete-Continuous  Variable  Distinction 

Variables  are  classified  as  discrete  or  continuous,  according  to  whether  the  number  of 
values  they  can  take  is  countable.  Actual  measurement  of  all  variables  occurs  in  a  discrete 
manner,  due  to  precision  limitations  in  measuring  instruments.  The  discrete-continuous 
classification,  in  practice,  distinguishes  between  variables  that  take  few  values  and  variables 
that  take  lots  of  values.  For  instance,  statisticians  often  treat  discrete  interval  variables  having 
a  large  number  of  values  (such  as  test  scores)  as  continuous,  using  them  in  methods  for 
continuous  responses. 

This  book  deals  with  certain  types  of  discretely  measured  responses:  (1)  binary  vari¬ 
ables,  (2)  nominal  variables,  (3)  ordinal  variables,  (4)  discrete  interval  variables  hav¬ 
ing  relatively  few  values,  and  (5)  continuous  variables  grouped  into  a  small  number  of 
categories. 


1.1.4  Quantitative-Qualitative  Variable  Distinction 

Nominal  variables  are  qualitative — distinct  categories  differ  in  quality,  not  in  quantity.  In¬ 
terval  variables  are  quantitative — distinct  levels  have  differing  amounts  of  the  characteristic 
of  interest.  The  position  of  ordinal  variables  in  the  qualitative-quantitative  classification 
is  fuzzy.  Analysts  often  treat  them  as  qualitative,  using  methods  for  nominal  variables. 
But  in  many  respects,  ordinal  variables  more  closely  resemble  interval  variables  than  they 
resemble  nominal  variables.  They  possess  important  quantitative  features:  Each  category 
has  a  greater  or  smaller  magnitude  of  the  characteristic  than  another  category;  and  although 
not  possible  to  measure,  an  underlying  continuous  variable  is  often  present.  The  political 
ideology  classification  (very  liberal,  slightly  liberal,  moderate,  slightly  conservative,  very 
conservative)  crudely  measures  an  inherently  continuous  characteristic. 

Analysts  often  utilize  the  quantitative  nature  of  ordinal  variables  by  assigning  numerical 
scores  to  the  categories  or  assuming  an  underlying  continuous  distribution.  This  requires 
good  judgment  and  guidance  from  researchers  who  use  the  scale,  but  it  provides  benefits 
in  the  variety  of  methods  available  for  data  analysis. 
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1.1.5  Organization  of  Book  and  Online  Computing  Appendix 

The  models  for  categorical  response  variables  discussed  in  this  book  resemble  regres¬ 
sion  models  for  continuous  response  variables;  however,  they  assume  binomial  or  multi¬ 
nomial  response  distributions  instead  of  normality.  One  type  of  model  receives  special 
attention — logistic  regression.  Ordinary  logistic  regression  models  apply  with  binary  re¬ 
sponses  and  assume  a  binomial  distribution.  Generalizations  of  logistic  regression  apply 
with  multicategory  responses  and  assume  a  multinomial  distribution. 

The  book  has  four  main  units.  In  the  first.  Chapters  1  through  3,  we  summarize  descriptive 
and  inferential  methods  for  univariate  and  bivariate  categorical  data.  These  chapters  cover 
discrete  distributions,  methods  of  inference,  and  measures  of  association  for  contingency 
tables.  They  summarize  the  non-model-based  methods  developed  prior  to  about  1960. 

In  the  second  and  primary  unit.  Chapters  4  through  10,  we  introduce  models  for  cate¬ 
gorical  responses.  In  Chapter  4  we  describe  a  class  of  generalized  linear  models  having 
models  of  this  text  as  special  cases.  Chapters  5  and  6  cover  the  most  important  model  for  bi¬ 
nary  responses,  logistic  regression.  Chapter  7  presents  alternative  methods  for  binary  data, 
including  the  probit,  Bayesian  fitting,  and  smoothing  methods.  In  Chapter  8  we  present 
generalizations  of  the  logistic  regression  model  for  nominal  and  ordinal  multicategory 
response  variables.  In  Chapters  9  and  10  we  introduce  the  modeling  of  multivariate  cate¬ 
gorical  response  data,  in  terms  of  association  and  interaction  patterns  among  the  variables. 
The  models,  called  loglinear  models,  apply  to  counts  in  the  table  that  cross-classifies  those 
responses. 

In  the  third  unit.  Chapters  11  through  14,  we  discuss  models  for  handling  repeated 
measurement  and  other  forms  of  clustered  data.  In  Chapter  1 1  we  present  models  for 
a  categorical  response  with  matched  pairs;  these  apply,  for  instance,  with  a  categorical 
response  measured  for  the  same  subjects  at  two  times.  Chapter  12  covers  models  for  more 
general  types  of  repeated  categorical  data,  such  as  longitudinal  data  from  several  times 
with  explanatory  variables.  In  Chapter  13  we  present  a  broad  class  of  models,  generalized 
linear  mixed  models,  that  use  random  effects  to  account  for  dependence  with  such  data.  In 
Chapter  14  further  extensions  of  the  models  from  Chapters  1 1  through  13  are  described, 
unified  by  treating  the  response  as  having  a  mixture  distribution  of  some  type. 

The  fourth  and  final  unit  has  a  different  nature  than  the  others.  In  Chapter  15  we  consider 
non-model-based  classification  and  clustering  methods.  In  Chapter  16  we  summarize  large- 
sample  and  small-sample  theory  for  categorical  data  models.  This  theory  is  the  basis  for 
behavior  of  model  parameter  estimators  and  goodness-of-fit  statistics.  Chapter  17  presents 
a  historical  overview  of  the  development  of  categorical  data  methods. 

Maximum  likelihood  methods  receive  primary  attention  throughout  the  book.  Many 
chapters,  however,  contain  a  section  presenting  corresponding  Bayesian  methods. 

In  Appendix  A  we  review  software  that  can  perform  the  analyses  in  this  book.  The 
website  www.stat.ufl.edu/~aa/cda/cda.html  for  this  book  contains  an  appendix 
that  gives  more  information  about  using  R,  SAS,  Stata,  and  other  software,  with  sample 
programs  for  text  examples.  In  addition,  that  site  has  complete  data  sets  for  many  text 
examples  and  exercises,  solutions  to  some  exercises,  extra  exercises,  corrections,  and  links 
to  other  useful  sites.  For  instance,  a  manual  prepared  by  Dr.  Laura  Thompson  provides 
examples  of  how  to  use  R  and  S-Plus  for  all  examples  in  the  second  edition  of  this  text, 
many  of  which  (or  very  similar  ones)  are  also  in  this  edition. 

In  the  rest  of  this  chapter,  we  provide  background  material.  In  Section  1.2  we  review  the 
key  distributions  for  categorical  data:  the  binomial  and  multinomial,  as  well  as  another  that 
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is  important  for  discrete  data,  the  Poisson.  In  Section  1 .3  we  review  the  primary  mechanisms 
for  statistical  inference  using  maximum  likelihood.  In  Sections  1 .4  and  1 .5  we  illustrate 
these  by  presenting  significance  tests  and  confidence  intervals  for  binomial  and  multinomial 
parameters.  In  Section  1.6  we  introduce  Bayesian  inference  for  these  parameters. 


1.2  DISTRIBUTIONS  FOR  CATEGORICAL  DATA 

Inferential  data  analyses  require  assumptions  about  the  random  mechanism  that  generated 
the  data.  For  regression  models  with  continuous  responses,  the  normal  distribution  plays  the 
central  role.  In  this  section  we  review  the  three  key  distributions  for  categorical  responses: 
binomial,  multinomial,  and  Poisson. 


1.2.1  Binomial  Distribution 

Many  applications  refer  to  a  fixed  number  n  of  binary  observations.  Let  y \,  yz . y„ 

denote  observations  from  n  independent  and  identical  trials  such  that  P(Y j  =  1)  =  it  and 
P(Yj  =  0)  =  1  —  it.  We  refer  to  outcome  1  as  “success”  and  outcome  0  as  “failure.” 
Identical  trials  means  that  the  probability  of  success  it  is  the  same  for  each  trial.  Independent 
trials  means  that  the  (F, }  are  independent  random  variables.  These  are  often  called  Bernoulli 
trials.  The  total  number  of  successes,  Y  =  >  has  the  binomial  distribution  with  index 

n  and  parameter  Tt,  denoted  by  bin(n,  it ). 

The  probability  mass  function  for  the  possible  outcomes  y  for  Y  is 

P(y)  =  (”)jry(1  -n)n~y,  y=  0,1,2 (1.1) 

where  the  binomial  coefficient  =  n\/[y\{n  —  y)!].  Since  E(Y,)  =  E(Y2)  =  1  x  it  + 
0  x  (1  —  n)  =  it. 


E{Yj)  =  n  and  var(F,)  =  7T  ( 1  —  it). 

The  binomial  distribution  for  Y  =  Yi  has  mean  and  variance 

p=E(Y)  =  nit  and  a2  =  var(F)  =  «7r(l  —  tt). 

The  skewness  is  described  by  E(Y  —  /z)3/< 7 3  =  (1  —  2it)/^/nit(\  —  it).  The  distribution 
is  symmetric  when  n  =  0.50  but  becomes  increasingly  skewed  as  it  moves  toward  either 
boundary.  The  binomial  distribution  converges  to  normality  as  n  increases,  for  fixed  it,  the 
approximation  being  reasonable1  when  «[min(7r,  1  —  it)]  is  as  small  as  about  5. 

There  is  no  guarantee  that  successive  binary  observations  are  independent  or  identical. 
Thus,  occasionally,  we  will  utilize  other  distributions.  One  such  case  is  sampling  binary 
outcomes  without  replacement  from  a  finite  population,  such  as  observations  on  whether  a 
homework  assignment  was  completed  for  10  students  sampled  from  a  class  of  size  20.  The 

'See  www . stat . tamu . edu/~west/applets/binomialdemo2 . html . 


6 


INTRODUCTION:  DISTRIBUTIONS  AND  INFERENCE  FOR  CATEGORICAL  DATA 


hypergeometric  distribution,  studied  in  Section  3.5.1,  is  then  relevant.  In  Section  1.2.4  we 
discuss  another  case  that  violates  the  binomial  assumptions. 


1.2.2  Multinomial  Distribution 

Some  trials  have  more  than  two  possible  outcomes.  Suppose  that  each  of  n  independent, 
identical  trials  can  have  outcome  in  any  of  c  categories.  Let  y, ,  =  1  if  trial  /  has  outcome 
in  category  j  and  >v>  =  0  otherwise.  Then  y,  =  (>vi ,  y,2, ....  y,y)  represents  a  multinomial 
trial,  with  Ylj  Yij  =  1 1  for  instance,  (0,  0,  1,0)  denotes  outcome  in  category  3  of  four 
possible  categories.  Note  that  y,y  is  redundant,  being  linearly  dependent  on  the  others. 
Let  nj  —  y,j  denote  the  number  of  trials  having  outcome  in  category  j.  The  counts 

(«i ,  «2 . «c)  have  the  multinomial  distribution. 

Let  7i j  =  P(Y(j  =  1)  denote  the  probability  of  outcome  in  category  j  for  each  trial.  The 
multinomial  probability  mass  function  is 


p(n\,n2. 


«c- 1) 


/  n\ 

\n\  !«2 !  ■  •  •  «c-! 


(1.2) 


Since  ]TL  nj  =  n,  this  is  (c  —  l)-dimensional,  with  nc  =  n  —  (n\  +  ■  ■  ■  +  nt—  i).  The  bino¬ 
mial  distribution  is  the  special  case  with  c  =  2. 

For  the  multinomial  distribution, 


E(nj)  =  mtj,  \ar(nj)  =  njr j(\  —  7tj),  cov(«/,  «*)  =  —  /?7r;7ty.  (1.3) 

We  derive  the  covariance  in  Section  1 6. 1 .4.  The  marginal  distribution  of  each  «,■  is  binomial. 


1.2.3  Poisson  Distribution 

Sometimes,  count  data  do  not  result  from  a  fixed  number  of  trials.  For  instance,  if  Y  = 
number  of  automobile  accidents  today  on  motorways  in  Italy,  there  is  no  fixed  upper  bound/? 
for  Y  (as  you  are  aware  if  you  have  driven  in  Italy!).  Since  Y  must  take  a  nonnegative  integer 
value,  its  distribution  should  place  its  mass  on  that  range.  The  simplest  such  distribution 
is  the  Poisson.  Its  probabilities  depend  on  a  single  parameter,  the  mean  p.  The  Poisson 
probability  mass  function  (Poisson  1837,  p.  206)  is 

C~^1  Li' 

P(y)  =  — r-.  ?  =  o,i,2 .  (1.4) 

It  satisfies  E(Y)  =  var(T)  =  p.  It  is  unimodal  with  mode  equal  to  the  integer  part  of  p. 
Its  skewness  is  described  by  E(Y  —  p)3 /a3  =  1  / ^fp.  The  Poisson  distribution  approaches 
normality  as  p  increases,  the  normal  approximation  being  quite  good  when  p  is  at  least 
about  10. 

The  Poisson  distribution  is  used  for  counts  of  events  that  occur  randomly  over  time  or 
space,  when  outcomes  in  disjoint  periods  or  regions  are  independent.  It  also  applies  as  an 
approximation  for  the  binomial  when  n  is  large  and  n  is  small,  with  p  =  nit .  For  example, 
suppose  Y  =  number  of  deaths  today  in  auto  accidents  in  Italy  (rather  than  the  number  of 
accidents).  Then,  Y  has  an  upper  bound.  If  each  of  the  50  million  people  driving  in  Italy 
is  an  independent  trial  with  probability  0.0000003  of  dying  today  in  an  auto  accident,  the 
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number  of  deaths  Y  is  a  bin(50000000,  0.0000003)  variate.  This  is  approximately  Poisson 
with  p.  =  nn  =  50000000(0.0000003)  =15. 

A  key  feature  of  the  Poisson  distribution  is  that  its  variance  equals  its  mean.  Sample 
counts  vary  more  when  their  mean  is  higher.  When  the  mean  number  of  daily  fatal  accidents 
equals  15,  greater  variability  occurs  from  day  to  day  than  when  the  mean  equals  2. 

1.2.4  Overdispersion 

In  practice,  count  observations  often  exhibit  variability  exceeding  that  predicted  by  the 
binomial  or  Poisson.  This  phenomenon  is  called  overdispersion.  We  assumed  above  that 
each  person  has  the  same  probability  each  day  of  dying  in  a  fatal  auto  accident.  More 
realistically,  these  probabilities  vary  from  day  to  day  according  to  the  amount  of  road  traffic 
and  weather  conditions  and  vary  from  person  to  person  according  to  factors  such  as  the 
amount  of  time  spent  in  autos,  whether  the  person  wears  a  seat  belt,  how  much  of  the 
driving  is  at  high  speeds,  gender,  and  age.  Such  variation  causes  fatality  counts  to  display 
more  variation  than  predicted  by  the  Poisson  model. 

Suppose  that  Y  is  a  random  variable  with  variance  var(y  |/i)  for  given  p,  but  p  itself 
varies  because  of  unmeasured  factors  such  as  those  just  described.  Let  9  =  E(p).  Then 
unconditionally, 

E(Y)  =  E[E(Y\p)l  var(T)  =  £[var(T|M)]  +  var[£(K \p)]. 

When  Y  is  conditionally  Poisson  (given  p),  then  E(Y)  —  E(p)  =  9  and  var(T)  =  E(p)  + 
var(/r)  =  9  +  var (p)  >  6. 

Assuming  a  Poisson  distribution  for  a  count  variable  is  often  too  simplistic,  because  of 
factors  that  cause  overdispersion.  The  negative  binomial  is  a  related  distribution  for  count 
data  that  has  a  second  parameter  and  permits  the  variance  to  exceed  the  mean.  We  introduce 
it  in  Section  4.3.4. 

Analyses  assuming  binomial  (or  multinomial)  distributions  are  also  sometimes  invalid 
because  of  overdispersion.  This  might  happen  because  the  true  distribution  is  a  mixture 
of  different  binomial  distributions,  with  the  parameter  varying  because  of  unmeasured 
variables.  To  illustrate,  suppose  that  an  experiment  exposes  pregnant  mice  to  a  toxin  and 
then  after  a  week  observes  the  number  of  fetuses  in  each  mouse’s  litter  that  show  signs  of 
malformation.  Let  n,  denote  the  number  of  fetuses  in  the  litter  for  mouse  i.  The  pregnant 
mice  also  vary  according  to  other  factors,  such  as  their  weight,  overall  health,  and  genetic 
makeup.  Extra  variation  then  occurs  because  of  the  variability  from  litter  to  litter  in  the 
probability  n  of  malformation.  The  distribution  of  the  number  of  fetuses  per  litter  showing 
malformations  might  cluster  near  0  and  near  showing  more  dispersion  than  expected 
for  binomial  sampling  with  a  single  value  of  n.  Overdispersion  could  also  occur  when  n 
varies  among  fetuses  in  a  litter  according  to  some  distribution  (Exercise  1.17).  In  Chapters 
4,  13,  and  14  we  introduce  methods  for  data  that  are  overdispersed  relative  to  binomial  and 
Poisson  assumptions. 

1.2.5  Connection  Between  Poisson  and  Multinomial  Distributions 

For  adult  residents  of  Britain  who  visit  France  this  year,  let  T]  =  number  who  fly  there, 
Y2  =  number  who  travel  there  by  train  without  a  car  (Eurostar),  T3  =  number  who  travel 
there  by  ferry  without  a  car,  and  Y4  =  number  who  take  a  car  (by  Eurotunnel  Shuttle  or 
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a  ferry).  A  Poisson  model  for  (T| ,  Y2,  Y3,  YA)  treats  these  as  independent  Poisson  random 
variables,  with  parameters  (/r/ ,  \x2 ,  p.3,  P-a)-  The  joint  probability  mass  function  for  {T, }  is 
the  product  of  the  four  mass  functions  of  form  ( 1 .4).  The  total  n  =  y,  also  has  a  Poisson 
distribution,  with  parameter  JT  Pi- 

With  Poisson  sampling  the  total  count  n  is  random  rather  than  fixed.  If  we  assume  a 
Poisson  model  but  condition  on  n,  {y,  }  no  longer  have  Poisson  distributions,  since  each 
y,  cannot  exceed  n.  Given  n,  {Y,  \  are  also  no  longer  independent,  since  the  value  of  one 
affects  the  possible  range  for  the  others. 

For  c  independent  Poisson  variates,  with  E(Y,)  =  p;,  the  conditional  probability  of  a 
set  of  counts  {n, }  satisfying  JT  y,  =  n  is 

P[(Y\  =nl,Y2=n2,...,Yc  =  nc)\'EiYj=n] 

j 

=  P(y,  =nuY2  =  n2,...,Yc  =  nc) 

P(ZJYJ=n) 

ri/[exp(— U  n\  t  r 

exp  n-/!1,1"'  ’  } 

where  {tt,  =  //.,-/(  ^  •  /tty)}.  This  is  the  multinomial  (n,  {tt,  })  distribution,  characterized  by 
the  sample  size  n  and  the  probabilities  {zr, } . 

Many  categorical  data  analyses  assume  a  multinomial  distribution.  Such  analyses  usually 
have  the  same  inferential  results  as  those  of  analyses  assuming  a  Poisson  distribution, 
because  of  the  similarity  in  the  likelihood  functions. 

1.2.6  The  Chi-Squared  Distribution 

Another  distribution  of  fundamental  importance  for  categorical  data  is  the  chi-squared , 
not  as  a  distribution  for  the  data  but  rather  as  a  sampling  distribution  for  many  statistics. 
Because  of  its  importance,  we  summarize  here  a  few  of  its  properties. 

The  chi-squared  distribution  with  degrees  of  freedom  denoted  by  df  has  mean  df,  vari¬ 
ance  2(df),  and  skewness  /df.  It  converges  (slowly)  to  normality  as  df  increases,  the 
approximation  being  reasonably  good  when  df  is  at  least  about  50. 

Let  Z  denote  a  standard  normal  random  variable  (mean  0,  variance  1).  Then  Z2  has  a 
chi-squared  distribution  with  df  =  1 .  A  chi-squared  random  variable  with  df  =  v  has  rep¬ 
resentation  Z2  +  •  •  •  +  Z2,  where  Z|, . . . ,  Z„  are  independent  standard  normal  variables. 
Thus,  a  chi-squared  statistic  having  df  =  v  has  partitionings  into  independent  chi-squared 
components — for  example,  into  v  components  each  having  df  =  1.  Conversely,  the  repro¬ 
ductive  property  states  that  if  A2  and  X\  are  independent  chi-squared  random  variables 
having  degrees  of  freedom  V\  and  v2,  then  X2  =  X\  +  X\  has  a  chi-squared  distribution 
with  df  =  vi  +  Vi. 


1.3  STATISTICAL  INFERENCE  FOR  CATEGORICAL  DATA 

In  practice,  the  probability  distribution  assumed  for  the  response  variable  has  unknown 
parameter  values.  In  this  section  we  review  methods  of  using  sample  data  to  make 
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inferences  about  the  parameters.  Sections  1.4  and  1.5  illustrate  these  methods  for  bino¬ 
mial  and  multinomial  parameters. 


1.3.1  Likelihood  Functions  and  Maximum  Likelihood  Estimation 

In  this  book  we  use  maximum  likelihood  for  parameter  estimation.  Maximum  likelihood 
estimators  have  desirable  properties:  They  have  large-sample  normal  distributions;  they 
are  asymptotically  consistent,  converging  to  the  parameter  as  n  increases;  and  they  are 
asymptotically  efficient,  producing  large-sample  standard  errors  no  greater  than  those  from 
other  estimation  methods.  These  results  hold  under  weak  regularity  conditions,  mainly  that 
the  number  of  parameters  remains  constant  as  n  increases  and  that  the  true  values  of  those 
parameters  fall  in  the  interior  (rather  than  on  the  boundary)  of  the  parameter  space. 

Given  the  data,  for  a  chosen  probability  distribution  the  likelihood  function  is  the  prob¬ 
ability  of  those  data,  treated  as  a  function  of  the  unknown  parameter.  The  maximum 
likelihood  (ML)  estimate  is  the  parameter  value  that  maximizes  this  function.  This  is  the 
parameter  value  under  which  the  data  observed  have  the  highest  probability  of  occurrence. 
We  denote  a  parameter  for  a  generic  problem  by  fi  and  its  ML  estimate  by  fi.  We  de¬ 
note  the  likelihood  function  by  1(f).  The  fi  value  that  maximizes  1(f)  also  maximizes 
L(fi)  =  log[£(/J)].  It  is  simpler  to  maximize  L(fi)  since  it  is  a  sum  rather  than  a  product  of 
terms.  For  many  models,  L(fi)  has  concave  shape  and  fi  is  the  point  at  which  the  derivative 
equals  0.  The  ML  estimate  is  then  the  solution  of  the  likelihood  equation,  dL(fi)/dfi  =  0. 
Often,  fi  is  multidimensional,  denoted  by  fi,  and  fi  is  the  solution  of  a  set  of  likelihood 
equations. 

Let  cow(fi)  denote  the  asymptotic  covariance  matrix  of  fi.  Under  regularity  conditions 
(Rao  1973,  p.  364),  co\(fi)  is  the  inverse  of  the  information  matrix.  The  (/,  k)  element  of 
the  information  matrix  is 


-  E 


(1.6) 


The  standard  errors  are  the  square  roots  of  the  diagonal  elements  for  the  inverse  of  the 
information  matrix.  The  greater  the  curvature  of  the  log  likelihood  function,  the  smaller 
the  standard  errors.  This  is  reasonable,  since  large  curvature  implies  that  the  log  likelihood 
drops  quickly  as  fi  moves  away  from  fi\  hence,  the  data  would  have  been  much  more  likely 
to  occur  if  fi  took  a  value  near  fi  rather  than  a  value  far  from  fi. 


1.3.2  Likelihood  Function  and  ML  Estimate  for  Binomial  Parameter 

The  part  of  a  likelihood  function  involving  the  parameters  is  called  the  kernel.  Since  the 
maximization  of  the  likelihood  is  done  with  respect  to  the  parameters,  the  rest  is  irrelevant. 

To  illustrate,  consider  the  binomial  distribution  (1.1).  The  binomial  coefficient 
n\/[y}(n  —  y)!]  has  no  influence  on  where  the  maximum  occurs  with  respect  to  n .  Thus, 
we  ignore  it  and  treat  the  kernel  as  the  likelihood  function.  The  binomial  log  likelihood 
function  is  then 


L(n)  =  logfzr  V(1  -  tt)"  v]  =  y  login)  +  (n  -  y)log(l  -  n). 


(1.7) 
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Differentiating  with  respect  to  tt  yields 

dL(n)/djr  =  y/n  —  (n  —  y)/(  1  —  n)  =  (y  —  nn)/n(\  —  n).  (1.8) 

Equating  this  to  0  gives  the  likelihood  equation,  which  has  solution  H  =  y/n,  the  sample 
proportion  of  successes  for  the  n  trials. 

Calculating  d2L(n)/dn2,  taking  the  expectation,  and  combining  terms,  we  get 

-  E[a2L(7r)/97T2]  =  E[y/n2  +(n-  y)/(l  -  tt)2]  =  n/[n(  1  -  n)\.  (1.9) 

Thus,  the  asymptotic  variance  of  tt  is  ;r(l  —  tt)/h.  This  is  no  surprise.  Since  E(Y)  =  nn 
and  var(T)  =  «7r(l  —  tt),  the  distribution  of  tt  =  Y/n  has  mean  and  standard  deviation 


E(TT)  =  TT, 


ct(tt)  = 


1.3.3  Wald-Likelihood  Ratio-Score  Test  Triad 

There  are  three  standard  ways  to  use  the  likelihood  function  to  perform  large-sample 
inference.  We  introduce  these  for  a  significance  test  of  a  null  hypothesis  //0:  ft  =  /So  and 
then  discuss  their  relation  to  interval  estimation.  They  all  exploit  the  large-sample  normality 
of  ML  estimators. 

Standard  errors  obtained  from  the  inverse  of  the  information  matrix  depend  on  the 
unknown  parameter  values.  When  we  substitute  the  unrestricted  ML  estimates  (i.e.,  not 
assuming  the  null  hypothesis)  we  obtain  an  estimated  standard  error  of  fi,  which  we  denote 
by  SE.  Denote  —  E[d2L(ff)/dfl2]  (i.e.,  the  information)  evaluated  at  $  by  i($).  The  first 
large-sample  inference  method  has  test  statistic  using  this  estimated  standard  error, 

2  =  0  -  Po)/SE,  where  SE  =  \/y[i ~0). 

This  statistic  has  an  approximate  standard  normal  distribution  when  f)  —  fio.  We  refer  z 
to  the  standard  normal  table  to  obtain  one-  or  two-sided  /’-values.  Equivalently,  for  the 
two-sided  alternative,  z2  has  an  approximate  chi-squared  null  distribution  with  df  —  1; 
the  /’-value  is  then  the  right-tailed  chi-squared  probability  above  the  observed  value.  This 
type  of  statistic,  using  the  nonnull  estimated  standard  error,  is  called  a  Wald  statistic  (Wald 
1943). 

The  multivariate  extension2  for  the  Wald  test  of  Hq:  fl  =  /?o  has  test  statistic 

W  =  (P-Po)T[covCp)r'(P-Po). 

The  nonnull  covariance  is  based  on  the  curvature  ( 1 .6)  of  the  log-likelihood  function  at  fl 
and  typically  itself  requires  estimation.  The  asymptotic  multivariate  normal  distribution  for 
implies  an  asymptotic  chi-squared  distribution  for  W.  The  df  equal  the  rank  of  cov(/j), 
which  is  the  number  of  nonredundant  parameters  in  fl. 


2The  7  superscript  on  a  vector  or  matrix  denotes  the  transpose. 
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A  second  general-purpose  method  uses  the  likelihood  function  through  the  ratio  of  two 
maximizations:  (1)  the  maximum  over  the  possible  parameter  values  under  Ho,  and  (2)  the 
maximum  over  the  larger  set  of  parameter  values  permitting  H o  or  an  alternative  Ha  to  be 
true.  Let  £<>  denote  the  maximized  value  of  the  likelihood  function  under  H o,  and  let  l\ 
denote  the  maximized  value  generally  (i.e.,  under  Ho  U  //„).  For  instance,  for  parameters 
P  =  (fio,  fix)  and  Ho:  fio  =  0,  l\  is  the  likelihood  function  calculated  at  the  fi  value  for 
which  the  data  would  have  been  most  likely;  lo  is  the  likelihood  function  calculated  at  the 
fix  value  for  which  the  data  would  have  been  most  likely,  when  fi ()  —  0.  Then  l\  is  always 
at  least  as  large  as  £<),  since  Iq  results  from  maximizing  over  a  restricted  set  of  the  parameter 
values. 

The  ratio  A  =  lo/l\  of  the  maximized  likelihoods  cannot  exceed  1.  Wilks  (1935,  1938) 
showed  that  —2  log  A  has  a  limiting  null  chi-squared  distribution,  as  n  — >•  oo.  The  df  equal 
the  difference  in  the  dimensions  of  the  parameter  spaces  under  Ho  U  Ha  and  under  Hq.  The 
likelihood-ratio  test  statistic  equals 

-2  log  A  =  -2  log( loll  i )  =  — 2( L{)  -  L  i ), 

where  Lo  and  L\  denote  the  maximized  log-likelihood  functions.  [In  this  book,  we  use 
the  natural  logarithm  throughout,  for  which  its  inverse  is  the  exponential  function;  so,  if 
a  —  log(fr),  then  h  =  exp(a)  =  e° .] 

The  third  method  uses  the  score  statistic ,  due  to  R.  A.  Fisher  and  C.  R.  Rao.  The  score 
test,  referred  to  in  some  literature  as  the  Lagrange  multiplier  test,  is  based  on  the  slope  and 
expected  curvature  of  the  log-likelihood  function  L(fi)  at  the  null  value  fio.  It  utilizes  the 
size  of  the  score  function 


u(fi)  =  dL(fi)/3fi, 

evaluated  at  fio.  The  value  u(fio)  tends  to  be  larger  in  absolute  value  when  fi  is  farther 
from  fi0.  Denote  —  E[d2 L(fi)/dfi2\  evaluated  at  fio  by  i(fio).  The  score  statistic  is  the  ratio 
of  "(fio)  to  its  null  SE.  which  is  [((A>)]l/2-  This  has  an  approximate  standard  normal  null 
distribution.  The  chi-squared  form  of  the  score  statistic  is 

["(fio)]2  =  [dL{fi)ldfiof 

((A))  -E[d2L(fi)/dfiiy 

where  the  notation  reflects  derivatives  with  respect  to  fi  that  are  evaluated  at  A)-  In  the 
multiparameter  case,  the  score  statistic  is  a  quadratic  form  based  on  the  vector  of  partial 
derivatives  of  the  log  likelihood  with  respect  to  fi  and  the  inverse  information  matrix,  both 
evaluated  at  the  Ho  estimates  (i.e.,  assuming  that  fi  =  fio). 

Figure  1.1  shows  a  plot  of  a  generic  log-likelihood  function  L(fi)  for  the  univariate 
case.  It  illustrates  the  three  tests  of  H0:  fi  =  0.  The  Wald  test  uses  the  behavior  of  L(fi)  at 
the  ML  estimate  fi,  having  chi-squared  form  ( fi/SE )2.  The  SE  of  fi  depends  on  the  cur¬ 
vature  of  L(fi)  at  fi.  The  score  test  is  based  on  the  slope  and  curvature  of  L(fi)  at  fi  =  0. 
The  likelihood-ratio  test  combines  information  about  L(fi)  at  both  fi  and  fio  —  0.  It  com¬ 
pares  the  log-likelihood  values  L  \  at  fi  and  Lo  at  fio  =  0  using  the  chi-squared  statistic 
—2 (Lo  —  L  i).  In  Figure  1.1,  this  statistic  is  twice  the  vertical  distance  between  values  of 
L(fi)  at  fi  and  at  0. 
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Figure  1.1  Log-likelihood  function  and  information  used  in  three  tests  of  Hq:  fi  =  0. 


Section  1.4.1  illustrates  the  Wald,  likelihood-ratio,  and  score  tests  for  inference  about  a 
binomial  parameter.  As  n  — »  oo,  the  three  tests  have  certain  asymptotic  equivalences  (Cox 
and  Hinkley  1974,  Sec.  9.3).  For  small  to  moderate  sample  sizes,  the  likelihood-ratio  and 
score  tests  are  usually  more  reliable  than  the  Wald  test,  having  actual  error  rates  closer  to 
the  nominal  level. 

1.3.4  Constructing  Confidence  Intervals  by  Inverting  Tests 

In  practice,  it  is  more  informative  to  construct  confidence  intervals  for  parameters  than  to 
test  hypotheses  about  their  values.  For  any  of  the  three  test  methods,  we  can  construct  a 
confidence  interval  by  inverting  the  test.  For  instance,  a  95%  confidence  interval  for  fi  is 
the  set  of  y80  f°r  which  the  test  of  Hq:  fi  =  fi0  has  f-value  exceeding  0.05. 

Let  zu  denote  the  2-score  from  the  standard  normal  distribution  having  right-tailed 
probability  a\  this  is  the  100(1  —  a)  percentile  of  that  distribution.  A  100(1  —  a)%  confi¬ 
dence  interval  based  on  asymptotic  normality  uses  za/2,  for  instance  zy .025  —  1 .96  for  95% 
confidence.  The  Wald  confidence  interval  is  the  set  of  fi 0  for  which  \j3  —  fi{)\/SE  <  za/2- 
This  gives  the  interval  fi  ±  za/2(SE).  Let  x}\^a)  denote  the  100(1  —  a)  percentile  of  the 
chi-squared  distribution  with  degrees  of  freedom  df.  The  likelihood-ratio-based  confidence 
interval  is  the  set  of  fo  for  which  —  2[L(/3o)  —  L(/3)]  <  /jHa).  [Note  that  xf(a)  =  z«/2-] 

When  j3  has  a  normal  distribution,  the  log-likelihood  function  has  a  parabolic  shape.  For 
small  samples  with  categorical  data,  fi  may  be  far  from  normality  and  the  log-likelihood 
function  can  be  far  from  a  symmetric,  parabolic-shaped  curve.  This  can  also  happen  with 
moderate  to  large  samples  when  fi  falls  near  the  boundary  of  the  parameter  space,  such 
as  a  population  proportion  that  is  near  0  or  near  1.  In  such  cases,  inference  based  on 
asymptotic  normality  of  fi  may  have  inadequate  performance.  A  marked  divergence  in 
results  of  Wald  and  likelihood-ratio  inference  indicates  that  the  distribution  of  fi  may  not 
be  close  to  normality.  The  example  in  Section  1.4.3  illustrates. 

The  Wald  confidence  interval  is  commonly  used  in  practice,  because  it  is  simple  to 
construct  using  ML  estimates  and  standard  errors  reported  by  statistical  software.  The 
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likelihood-ratio-test-based  interval  is  becoming  more  widely  available  in  software  and  is 
preferable  for  categorical  data  with  small  to  moderate  n.  The  score-test-based  interval  is 
widely  available  only  in  certain  cases,  such  as  for  proportions  as  outlined  in  Section  1 .4.2. 
For  the  best  known  statistical  model,  regression  for  a  normal  response,  the  three  types  of 
inference  provide  identical  results.  In  later  chapters,  we'll  use  versions  of  these  intervals 
that  apply  for  models  with  multiple  parameters.  Especially  useful  is  the  profile  likelihood 
approach  based  on  inverting  likelihood-ratio  tests  (e.g..  Section  3.2.6). 


1.4  STATISTICAL  INFERENCE  FOR  BINOMIAL  PARAMETERS 

In  this  section  we  illustrate  inference  methods  for  categorical  data  by  presenting  tests  and 
confidence  intervals  for  the  binomial  parameter  n .  With  y  successes  in  n  independent 
trials,  recall  that  the  ML  estimator  of  n  is  A  =  y/n,  for  which  E (ft)  =  n  and  var(^)  = 
7r(l  —  n)/n. 


1.4.1  Tests  About  a  Binomial  Parameter 

Consider  H0:  n  —  7r0.  Since  //q  has  a  single  parameter,  we  use  the  normal  rather  than 
chi-squared  forms  of  Wald  and  score  test  statistics.  They  permit  tests  against  one-sided  as 
well  as  two-sided  alternatives. 

The  Wald  statistic  for  testing  H0:  n  —  tt0  is 


ft  —  7Tq  ft  —  77"  0 

SE  yjn(  1  —  n)/n 


0.10) 


To  find  the  score  statistic,  we  evaluate  the  binomial  score  (1.8)  and  information  ( 1 .9)  at  tzq. 
This  yields 


u(tt0)  = 


y_ 

TTo 


n-y 

1  -  7T()  ' 


l(7T  o)  = 


n 

7To(l  —  7Tq) 


The  normal  form  of  the  score  statistic  simplifies  to 


llfjlf)  y  —  «7Tq  a  —  7Tq 

[(Cto)]i/2  \  -  71  V^oO  -  7To)/rc' 


(1.11) 


Whereas  the  Wald  statistic  iw  uses  the  standard  error  evaluated  at  tt,  the  score  statistic  zs 
uses  it  evaluated  at  7r0.  The  score  statistic  is  preferable,  as  it  uses  the  actual  null  SE  rather 
than  an  estimate.  Its  null  sampling  distribution  is  closer  to  standard  normal  than  that  of  the 
Wald  statistic. 

The  binomial  log-likelihood  function  (1.7)  equals  Lq  =  ylog^o  +  (n  —  y)log(l  —  jtq) 
under  //(>  and  L i  =  y  log  n  +  (n  —  y)  log(  1  —  tt)  more  generally.  The  likelihood-ratio  test 
statistic  simplifies  to 


ft 

y  log  — 

7T() 


+  (n  -  y)  log 


1  —  ft 

1  -  7T0 


—2{Lo  —  L  i)  =  2 
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Expressed  as 


—2(Lo  —  L\) 


y  log  —  +  (/)  -  y  )  log  — — — 
into  n  —  httq 


it  compares  observed  success  and  failure  counts  with  fitted  counts  under  Hq  by 


2  ^  observed 


observed 

fitted 


(1.12) 


We’ll  see  that  this  formula  also  holds  for  tests  about  Poisson  and  multinomial  parameters. 
Since  no  unknown  parameters  occur  under  Hq  and  one  occurs  under  Ha,  the  asymptotic 
chi-squared  distribution  for  ( 1 . 1 2)  has  df  =  1  —  0  =  1 . 


1.4.2  Confidence  Intervals  for  a  Binomial  Parameter 

Inverting  the  Wald  test  statistic  gives  the  interval  of  jiq  values  for  which  \zw  |  <  :a/2,  or 


ft  ±  Ia/2 


ft(\  —  ft ) 


(1.13) 


Historically,  this  was  one  of  the  first  confidence  intervals  used  for  any  parameter  (Laplace 
1812,  p.  283).  Unfortunately,  it  performs  poorly  unless  n  is  very  large  (e.g.,  Brown  et  al. 
2001),  in  the  sense  that  the  actual  probability  that  the  interval  contains  n  usually  falls  below 
the  nominal  confidence  coefficient,  much  below  when  n  is  near  0  or  1 . 

The  likelihood-ratio-based  confidence  interval  is  more  complex  computationally,  but 
simple  in  principle.  It  is  the  set  of  jtq  for  which  the  likelihood-ratio  test  has  a  P-value 
exceeding  a.  Equivalently,  it  is  the  set  of  no  for  which  double  the  log  likelihood  drops  by 
less  than  x^(a)  from  its  value  at  the  ML  estimate  ft  =  y/n.  For  example,  the  endpoints  of 
the  95%  confidence  interval  can  be  found  using  numerical  methods  to  iteratively  solve  for 
the  values  of  ttq  that  satisfy 


ft  1  —  ft 

y  log  —  +(n  -  y)  log  - - 

7T0  i  -  zi  Q  J 


=  X,  (0.05)  =  3.84. 


The  score  confidence  interval  contains  ttq  values  for  which  |z,$|  <  za/2-  Its  endpoints  are 
the  7ro  solutions  to  the  equations 


(ft  -  7Zq)I yj 7To(  1  ~  TTq)/ 11  =  ±Z„/2. 


These  are  quadratic  in  jtq.  First  discussed  by  Wilson  (1927),  this  interval  is 

_2  \ 


71 


-ay  2, 


1 

+  2 


“a/2 


%/2, 


±Ca/2 


ft(\  —  ft) 


n  +  z 


+ 


a/2, 


W  (\ 


t2 


ar/2 


,  n  + 


„2 

‘■a/2, 


\,1  +  Za/2 


(1-14) 
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The  midpoint  is  a  weighted  average  of  ft  and  5,  where  the  weight  n/(n  +  z2^)  given  ft 
increases  as  n  increases.  Combining  terms,  this  midpoint  equals  ft  =  (y  +  zl/2/mn  + 
zl„).  This  is  the  sample  proportion  for  an  adjusted  sample  that  adds  z2/2  observations, 
half  of  each  type,  for  example,  Zq015/2  —  1.962/2  2  of  each  type  for  95%  intervals. 

The  square  of  the  coefficient  of  za/2  in  (1.14)  is  a  weighted  average  of  the  variance  of  a 
sample  proportion  when  n  —  ft  and  the  variance  of  a  sample  proportion  when  tt  =  4,  using 
the  adjusted  sample  size  n  +  z2/2  >n  place  of  n. 

For  95%  confidence,  the  score  interval  can  be  approximated  by  a  simple  adjustment  of 
the  Wald  interval  (see  Exercise  1 .25)  that  adds  2  observations  of  each  type  to  the  sample 
before  using  the  Wald  formula  (1.13).  This  interval  and  the  ordinary  score  interval  tend  to 
have  actual  coverage  probability  much  closer  to  the  nominal  level  than  the  Wald  interval 
(Agresti  and  Coull  1998,  Agresti  and  Caffo  2000). 

1.4.3  Example:  Estimating  the  Proportion  of  Vegetarians 

To  collect  data  to  illustrate  concepts  in  introductory  statistics  courses,  often  I  have  given  the 
students  a  questionnaire.  One  year  I  asked  each  student  in  an  honors  class  at  the  University 
of  Florida  whether  he  or  she  was  a  vegetarian.  Of  n  =  25  students,  y  —  0  answered  “yes.” 
They  were  not  a  random  sample  of  a  particular  population,  but  we  use  these  data  to  illustrate 
95%  confidence  intervals  for  a  binomial  parameter  n. 

Since  y  =  0,  the  ML  estimate#  =  0/25  =  0.  With  the  Wald  method,  the  95%  confidence 
interval  for  it  is 


ft  ±  1.96v/#(l  —  A)/n,  which  is  0  ±  1 .96-/(0.0  x  1 ,0)/25,  or  (0,0). 

When  a  parameter  falls  near  the  boundary  of  the  sample  space,  often  sample  estimates  of 
standard  errors  are  poor  and  the  Wald  method  does  not  provide  a  sensible  answer. 

By  contrast,  the  95%  score  interval  equals  (0.0,  0.133).  That  is,  when  ft  =  0.0  and 
n  =  25,  the  two  roots  for  tt0  that  satisfy  the  equation 

I A  ~  ttoI  =  1  -96v/7r0(  i  -  n0)/n 

are  tiq  =  0.0  and  jzq  =0.133.  This  interval  provides  a  more  believable  inference.  It 
contains  the  values  not  rejected  in  corresponding  score  tests  with  size  (probability  of 
type  I  error)  0.05.  For  Hq\  jz  —  0.20,  for  instance,  the  score  test  statistic  is  zs  —  (0  — 
0.20) / y (0. 20  x  0. 80)/25  =  -2.50,  which  has  two-sided  P-value  0.012  <  0.05,  so  0.20 
does  not  fall  in  the  interval.  By  contrast,  for  H  0:  n  =  0. 10,  zs  =  (0  — 
0.10)/y/(0. 10  x  0.90)/25  =  —1.67,  which  has  P-value  0.096  >  0.05,  so  0.10  falls  in  the 
interval. 

When  y  =  0  and  n  =  25,  the  kernel  of  the  likelihood  function  is  f(7r)  =  7T°(1  —  7r)25  = 
(1  —  7r)25.  The  log-likelihood  function  (1.7)  is  L(n)  =  25  log(l  —  n).  Note  that  L(ft)  = 
L( 0)  =  0.  The  95%  likelihood-ratio  confidence  interval  is  the  set  of  hq  for  which  the 
likelihood-ratio  statistic 

— 2(L0  —  L\)  =  — 2[L(7t0)  —  L(tt)\ 

=  — 501og(i  -  tzo)  <  x?(0.05)  =  3.84. 
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n 


0.1  0.2 


Figure  1.2  Binomial  likelihood  and  log  likelihood  when  y  =  0  in  n  =  25  trials,  and  likelihood-ratio  test-based 
confidence  interval  for  n. 


The  upper  bound  is  1  —  exp(— 3.84/50)  =  0.074,  and  the  confidence  interval  equals  (0.0, 
0.074).  Figure  1.2  shows  the  likelihood  and  log-likelihood  functions  and  the  corresponding 
confidence  region  for  n . 

The  three  large-sample  methods  yield  quite  different  results.  When  tt  is  near  0,  the 
sampling  distribution  of  A  is  highly  skewed  to  the  right  for  small  n.  From  numerical 
evaluations,  we  prefer  the  interval  based  on  inverting  the  score  test. 

1.4.4  Exact  Small-Sample  Inference  and  the  Mid  P- Value 

With  modern  computational  power,  it  is  not  necessary  to  rely  on  large-sample  approxima¬ 
tions  for  the  distribution  of  estimators  such  as  A.  Tests  and  confidence  intervals  can  directly 
use  the  binomial  distribution  rather  than  its  normal  approximation.  Such  inferences  occur 
naturally  for  small  samples,  but  apply  for  any  n. 

We  illustrate  by  testing  H0\  it  =  0.50  against  Ha:  tc  ^  0.50  for  the  survey  results 
on  vegetarianism  just  discussed,  namely,  y  =  0  with  n  —  25.  We  noted  that  the  score 
statistic  equals  z  =  —5.0.  The  exact  P- value  for  this  statistic,  based  on  the  null  bin(25,  0.50) 
distribution,  is 

P(\z\  >  5.0)  =  P(Y  =  0  or  Y  =  25)  =  0.5025  +  0.5025  =  0.00000006. 

Because  of  discreteness,  in  testing  //0:  it  =  7to,  it  is  not  usually  possible  to  achieve  a 
particular  fixed  size  such  as  0.05.  With  a  finite  number  of  possible  samples,  there  is  a  finite 
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number  of  possible  P-values,  of  which  0.05  may  not  be  one.  When  n  =  25  and  jtq  =  0.50, 
for  example,  the  two-sided  P-value  using  the  binomial  probabilities  is  0.043  if  y  =  7  or  if 
y  =  1 8  and  it  is  0. 108  if  y  =  8  or  if  y  =  17.  Thus,  if  we  reject  Hq  when  y  <  7  or  y  >  18, 
the  test  is  conservative ,  in  the  sense  that  the  actual  size  (i.e.,  0.043)  is  less  than  the  nominal 
size  (0.05). 

To  adjust  somewhat  for  discreteness  in  small-sample  distributions,  we  can  base  inference 
on  the  mid  P-value  (Lancaster  1949b,  1961).  For  a  test  statistic  T  with  observed  value  tD 
and  one-sided  Ha  such  that  large  T  contradicts  Hq, 

mid  P-value  =  \P{T  =  t0)  +  P(T  >  ta), 

with  probabilities  calculated  from  the  null  distribution.  Thus,  the  mid  P-value  is  less  than 
the  ordinary  P-value  by  half  the  probability  of  the  observed  result.  Although  discrete, 
compared  with  the  ordinary  P-value,  the  mid  P-value  behaves  more  like  the  P-value  for  a 
test  statistic  having  a  continuous  distribution:  The  sum  of  its  two  one-sided  P- values  equals 
1 .0.  Under  H0,  it  has  a  null  expected  value  of  0.50  (like  the  uniform  distribution  that  occurs 
in  the  continuous  case),  whereas  this  expected  value  exceeds  0.50  for  the  ordinary  P-value 
for  a  discrete  test  statistic. 

Unlike  an  exact  test  with  ordinary  P-value,  a  test  using  the  mid  P-value  does  not  guarantee 
that  the  size  of  the  test  is  no  greater  than  a  nominal  value  (Exercise  1.12).  However,  it  usually 
performs  well.  It  is  less  conservative  than  the  ordinary  exact  test.  Inference  based  on  the 
mid  P-value  compromises  between  the  conservativeness  of  exact  methods  and  the  uncertain 
adequacy  of  large-sample  methods. 

Similarly,  we  can  use  small-sample  distributions  to  construct  confidence  intervals  for 
parameters.  Some  subtle  issues  arise  such  that  the  choice  of  such  an  interval  is  not  straight¬ 
forward,  and  we  defer  this  topic  to  a  special  section  ( 1 6.6)  in  Chapter  1 6  about  small-sample 
intervals  for  categorical  data. 


1.5  STATISTICAL  INFERENCE  FOR  MULTINOMIAL  PARAMETERS 

Next  we  consider  inference  for  multinomial  parameters  {jTj}.  Of  n  observations  in  c  cate¬ 
gories,  Hj  occur  in  category  j,  j  =  1 , ,c. 

1.5.1  Estimation  of  Multinomial  Parameters 

First,  we  obtain  ML  estimates  of  {7 Ty}.  As  a  function  of  {zry },  the  multinomial  probability 
mass  function  (1.2)  is  proportional  to  the  kernel 

Yin?,  where  all  Try  >  0  and  y^7r,  =  1.  (1.15) 

j  j 

The  ML  estimates  are  the  {ttj}  that  maximize  (1.15). 

The  multinomial  log-likelihood  function  is 

L(n)  =  £>,  loS^j- 

j 
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To  eliminate  redundancies,  we  treat  Las  a  function  of  (7T],  . . . ,  ttc-i),  since  nc  =  1  —  (n\  + 
•  •  •  +  nc- 1).  Thus,  dnjdnj  =  —  1,  j  =  l,...,c  —  1.  Since 

9  log  nc  1  dnc  1 

9  TV  j  7Tr  9  7T  j  JTC 

differentiating  L(jt)  with  respect  to  n j  gives  the  likelihood  equation 

9L{jc)  _  _ 

dn  j  7Tj  7TC 

The  ML  solution  satisfies  ft j/ftc  =  rij/nc.  Now 

{  J2j  ni)  _  ftcff 

nc  nc  ’ 

so  ftc  =  n,j  n  and  then  ft  j  =  rij/n.  From  general  results  presented  later  in  the  book  (Sec¬ 
tion  9.6),  this  solution  does  maximize  the  likelihood.  Thus,  the  ML  estimates  of  {n j}  are 
the  sample  proportions. 


1.5.2  Pearson  Chi-Squared  Test  of  a  Specified  Multinomial 

In  1900  the  eminent  British  statistician  Karl  Pearson  introduced  a  hypothesis  test  that 
was  one  of  the  first  inferential  methods.  It  had  a  revolutionary  impact  on  categorical  data 
analysis.  Pearson’s  test  evaluates  whether  multinomial  parameters  equal  certain  values.  His 
original  motivation  in  developing  this  test  was  to  analyze  whether  possible  outcomes  on  a 
particular  Monte  Carlo  roulette  wheel  were  equally  likely  (Stigler  1986). 

Consider  Hq  :  jtj  =  Jijo,  j  =  1 , . . . ,  c,  where  71  jo  =  1  •  When  Hq  is  true,  the  expected 
values  of  {«;},  called  expected  frequencies,  are  jij  =  niijo,  j  =  \, ...  ,c.  Pearson  proposed 
the  test  statistic 


x2  _  ~  d't')2 

J  ^ 


(1.16) 


Greater  differences  \n  j  —  pf  produce  greater  X 2  values,  for  fixed  {zryo }  and  n.  Let  X20 
denote  the  observed  value  of  X 1 .  The  P-value  is  the  null  value  of  P(X 2  >  X^).  This  equals 
the  sum  of  the  null  multinomial  probabilities  of  all  count  arrays  (having  a  sum  of  n)  with 

X2>X02. 

For  large  samples,  X2  has  approximately  a  chi-squared  distribution  with  df  =  c  —  1. 
The  P- value  is  approximated  by  P ( x(2~ \  >  X2),  where  xf—\  denotes  a  chi-squared  random 
variable  with  df  =  c  —  1.  Statistic  (1. 16)  is  called  the  Pearson  chi-squared  statistic. 


1.5.3  Likelihood-Ratio  Chi-Squared  Test  of  a  Specified  Multinomial 

An  alternative  test  for  multinomial  parameters  uses  the  likelihood-ratio  test.  The  kernel  of 
the  multinomial  likelihood  is  (1.15).  Under  Hq  the  likelihood  is  maximized  when  ftj  —  jTjq. 
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In  the  general  case,  it  is  maximized  when  Aj  =  tij/n.  The  ratio  of  the  likelihoods  equals 

_  n>w" 

pi,  («,/«)"'  ’ 


Thus,  the  likelihood-ratio  statistic,  denoted  by  G2,  is 

G 2  =  — 21og  A  =  2  ^2  tij  log(«  j  / nit y0).  (1.17) 

j 

This  statistic,  which  has  form  (1.12),  is  called  the  likelihood-ratio  chi- squared  statistic. 
The  larger  the  value  of  G2,  the  greater  the  evidence  against  Hq. 

In  the  general  case,  the  parameter  space  consists  of  [jt j}  subject  to  itj  =  1,  so  the 
dimensionality  is  c  —  1.  Under  Ho,  the  {n , }  are  specified  completely,  so  the  dimension  is 
0.  The  difference  in  these  dimensions  equals  (c  —  1 ).  For  large  n,  G2  has  a  chi-squared  null 
distribution  with  df  =  c  —  1 . 

When  Hq  holds,  the  Pearson  X2  and  the  likelihood  ratio  G2  both  have  large-sample 
chi-squared  distributions  with  df  =  c  —  1.  In  fact,  they  are  asymptotically  equivalent  in 
that  case;  specifically,  X2  —  G2  converges  in  probability  to  zero.  [This  means  that  for  any 
e  >  0,  P(\X2  —  G2|  <  e)  — >  1  as  n  — *  oo;  See  Section  16.3.4.]  When  Hq  is  false,  X2  and 
G2  grow  in  expectation  proportionally  to  tv,  they  need  not  take  similar  values,  however, 
even  for  very  large  n. 

For  fixed  c,  as  n  increases  the  distribution  of  X2  usually  converges  to  chi-squared  more 
quickly  than  that  of  G2.  The  chi-squared  approximation  is  often  poor  for  G2  when  n / c  <  5. 
When  c  is  large,  it  can  be  decent  for  X2  for  n/c  as  small  as  1  if  the  table  does  not  contain 
both  very  small  and  moderately  large  expected  frequencies. 

Alternatively,  the  multinomial  probabilities  induce  exact  distributions  of  these  test  statis¬ 
tics.  When  it  is  not  feasible  to  quickly  enumerate  all  the  possible  samples,  it  is  simple  to 
simulate  the  exact  distributions  by  randomly  generating  a  very  large  number  of  multinomial 
samples  of  size  n  with  the  null  probabilities,  and  calculating  X2  and  or  G2  for  each  sample 
(Hirji  2005,  Chap.  13).  The  simulated  P-value  is  the  proportion  of  test  statistic  values  that 
are  at  least  as  large  as  the  observed  value. 

1.5.4  Example:  Testing  Mendel’s  Theories 

Among  its  many  applications,  Pearson’s  test  was  used  in  genetics  to  test  Mendel’s  theories 
of  natural  inheritance.  Mendel  crossed  pea  plants  of  pure  yellow  strain  with  plants  of  pure 
green  strain.  He  predicted  that  second-generation  hybrid  seeds  would  be  75%  yellow  and 
25%  green,  yellow  being  the  dominant  strain.  One  experiment  produced  n  =  8023  seeds, 
of  which  n\  =  6022  were  yellow  and  n2  =  2001  were  green.  The  expected  frequencies 
for  H0:  ttio  =  0.75,  7r20  =  0.25  are  /a,  =  8023(0.75)  =  6017.25  and  /i2  =  2005.75.  The 
Pearson  statistic  X2  =  0.015  and  the  likelihood-ratio  statistic  G2  =  0.015  (df  =  1)  have 
P-values  of  P  =  0.90.  They  do  not  contradict  Mendel’s  hypothesis. 

When  c  =  2,  Pearson’s  X2  simplifies  to  the  square  of  the  normal  score  statistic  (1.11). 
For  Mendel’s  data,  A\  =  6022/8023,  7t\o  =  0.75,  n  =  8023,  and  z$  =  0.123,  for  which 
X2  —  (0. 123)2  =  0.015.  In  fact,  for  general  c  the  Pearson  test  is  the  score  test  about 
specified  values  for  multinomial  parameters. 
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Mendel  performed  several  experiments  of  this  type.  In  1936,  R.  A.  Fisher  summarized 
Mendel’s  results.  He  used  the  reproductive  property  of  chi-squared:  If  Xf, . . . ,  Xj  are 
independent  chi-squared  statistics  with  degrees  of  freedom  v\, . . . ,  v*,  then  Xw=1  Xf  has  a 
chi-squared  distribution  with  df  =  ]T,=1  f,  .  Fisher  obtained  a  summary  chi-squared  statistic 
equal  to  42,  with  df  =  84.  A  chi-squared  distribution  with  df  =  84  has  mean  84  and  standard 
deviation  (2  x  84) 1/2  =  13.0,  and  the  right-tailed  probability  above  42  is  P  =  0.99996.  In 
other  words,  the  chi-squared  statistic  was  so  small  that  the  fit  seemed  too  good. 

Fisher  commented:  “The  general  level  of  agreement  between  Mendel’s  expectations 
and  his  reported  results  shows  that  it  is  closer  than  would  be  expected  in  the  best  of 
several  thousand  repetitions  ....  I  have  no  doubt  that  Mendel  was  deceived  by  a  gardening 
assistant,  who  knew  only  too  well  what  his  principal  expected  from  each  trial  made.”  In  a 
letter  written  at  the  time,  he  stated:  “Now,  when  data  have  been  faked,  I  know  very  well 
how  generally  people  underestimate  the  frequency  of  wide  chance  deviations,  so  that  the 
tendency  is  always  to  make  them  agree  too  well  with  expectations”  (Box  1978,  p.  297).  In 
summary,  goodness-of-fit  tests  can  reveal  not  only  when  a  fit  is  inadequate,  but  also  when  it 
is  better  than  random  fluctuations  would  have  us  expect.  [Fisher’s  daughter,  Joan  Fisher  Box 
(1978,  pp.  295-300),  discussed  Fisher’s  analysis  of  Mendel’s  data  and  the  accompanying 
controversy.  See  also  Pires  and  Branco  (2010).  Despite  possible  difficulties  with  Mendel’s 
data,  subsequent  work  led  to  general  acceptance  of  his  theories.] 

1.5.5  Testing  with  Estimated  Expected  Frequencies 

The  chi-squared  statistics  (1.16)  and  (1.17)  compare  a  sample  distribution  to  a  hypothetical 
one  {7Tyo}  -  In  some  applications,  {7tjo  =  ttjo(0)}  are  functions  of  a  smaller  set  of  unknown 
parameters  0.  ML  estimates  0  of  0  determine  ML  estimates  {rtjo(0)}  of  {rtjo}  and  hence 
ML  estimates  [flj  =  n7tjo{0)}  of  expected  frequencies. 

Replacing  [p.j]  by  estimates  {p.j}  affects  the  distribution  of  X 2  and  G2.  When  dim(0)  = 
p ,  the  true  df  =  (c  —  1)  —  p  (Section  16.3.3).  Pearson  (1917)  realized  this  but  did  not 
always  take  it  into  account  (Section  17.2). 

1.5.6  Example:  Pneumonia  Infections  in  Calves 

We  now  show  a  goodness-to-fit  test  with  estimated  expected  frequencies.  A  sample  of  156 
dairy  calves  bom  in  Okeechobee  County,  Florida,  were  classified  according  to  whether  they 
caught  pneumonia  within  60  days  of  birth.  Calves  that  got  a  pneumonia  infection  were  also 
classified  according  to  whether  they  got  a  secondary  infection  within  2  weeks  after  the  first 
infection  cleared  up.  Table  1.1  shows  the  data.  Calves  that  did  not  get  a  primary  infection 


Table  1.1  Primary  and  Secondary  Pneumonia 
Infections  in  Calves 


Secondary  Infection" 

Primary  Infection 

Yes 

No 

Yes 

30  (38.1) 

63  (39.0) 

No 

0(-) 

63  (78.9) 

“Values  in  parentheses  are  estimated  expected  frequencies. 
Source'.  Data  courtesy  of  Thang  Tran  and  G.  A.  Donovan, 
College  of  Veterinary  Medicine,  University  of  Florida. 
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Table  1.2  Probability  Structure  for  Hypothesis 


Secondary  Infection 

Primary  Infection 

Yes  No  Total 

Yes 

n1  n(\  —  n)  n 

No 

—  1  —  n  1  —  n 

could  not  get  a  secondary  infection,  so  no  observations  can  fall  in  the  category  for  “no” 
primary  infection  and  “yes”  secondary  infection.  That  combination  is  called  a  structural 
zero, 

A  goal  of  this  study  was  to  test  whether  the  probability  of  primary  infection  was  the  same 
as  the  conditional  probability  of  secondary  infection,  given  that  the  calf  got  the  primary 
infection.  In  other  words,  if  n ah  denotes  the  probability  that  a  calf  is  classified  in  row  a  and 
column  b  of  this  table,  the  null  hypothesis  is 


Ho'.  Tt  II  +  7T\2  —  7T\\/(7T\)  +  7T 12) 


or  Tt\  \  —  (jt  1 1  +  7Ti2)2.  Let  tt  =  Tt\  \  +  7T|2  denote  the  probability  of  primary  infection.  The 
null  hypothesis  states  that  the  probabilities  satisfy  the  structure  that  Table  1.2  shows; 
that  is,  probabilities  in  a  trinomial  for  the  categories  (yes-yes,  yes-no,  no-no)  for 
primary-secondary  infection  equal  [tt2,  tt(1  —  tt),  1  —  tt]. 

Let  nab  denote  the  number  of  observations  in  row  a  and  column  b  of  Table  1.1.  The  ML 
estimate  of  n  is  the  value  maximizing  the  kernel  of  the  multinomial  likelihood 


(n2)n"(n  —  zr2)"l2(l  -  tt)"22. 


The  log  likelihood  is 

Lin)  =  n\\  log n 2  +  nn  log(?r  -  n2)  +  n2 2  log(l  -  n). 

Differentiation  with  respect  to  n  gives  the  likelihood  equation 

2«H  «|2  n\2  «22  _  Q 

n  n  1  —  n  1  —  n 


The  solution  is 


ft  =  {2n  1 1  +  ni2)/(2n , ,  +  2n  ,2  +  n  22). 

For  Table  1 . 1 ,  n  =  0.494.  Since  n  —  1 56,  the  estimated  expected  frequencies  are  A  \  \  = 
nn 2  =  38.1,  A 12  =  n(n  —  7f2)  =  39.0,  and  A22  =  «(1  —  n )  =  78.9.  Table  1.1  shows  them. 
Pearson’s  statistic  is  X 2  =  19.7.  Since  the  c  =  3  possible  responses  have  p  =  1  parameter 
in)  determining  the  expected  frequencies,  df=(3— 1)— 1  =  1.  There  is  strong  evidence 
against  Ho  (P  =  0.00001).  Inspection  of  Table  1.1  reveals  that  many  more  calves  got  a 
primary  infection  but  not  a  secondary  infection  than  Hq  predicts.  The  researchers  con¬ 
cluded  that  the  primary  infection  had  an  immunizing  effect  that  reduced  the  likelihood  of 
a  secondary  infection. 
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1.5.7  Chi-Squared  Theoretical  Justification 

We  now  outline  why  Pearson’s  statistic  for  a  specified  multinomial  has  a  limiting  chi- 
squared  distribution.  Derivations  for  the  likelihood-ratio  statistic  and  cases  with  estimated 
expected  frequencies  are  given  in  Section  16.3. 

For  a  multinomial  sample  (n\, ... ,  nc)  of  size  n,  the  marginal  distribution  of  tij  is  the 
bin(«,  7i j)  distribution.  For  large  n,  by  the  normal  approximation  to  the  binomial,  (and 
jfj  =  rij/n)  have  approximate  normal  distributions.  More  generally,  by  the  central  limit 
theorem,  the  sample  proportions  ft  =  (n\/n, . . . ,  n(—\/n)T  have  an  approximate  multivari¬ 
ate  normal  distribution  (Section  16.1.4).  Let  Xq  denote  the  null  covariance  matrix  of  y/n  ft, 
and  let  n o  =  (trio,  . .  - ,  7T(—  i,o)r.  Under  //<),  since  y/n(ft  —  n$)  converges  to  a  N( 0,  Xo) 
distribution,  the  quadratic  form 


n(ft  -  71q)T  X0  '(tf  -  7Z 0) 


(1.18) 


has  distribution  converging  to  chi-squared  with  df  =  c  —  1 . 

In  Section  16.1.4  we  show  that  the  covariance  matrix  of  y/nft  has  elements 


-TTjJTk  if  j^k 

JTjil-TTj)  if  j=k' 


The  matrix  Xq  1  has  (j,  &)th  element  l/7r<o  when  j  ^  k  and  (l/tryo  +  1/tLo)  when  j  = 
k.  (You  can  verify  this  by  showing  that  X<>  X^1  equals  the  identity  matrix.)  With  this 
substitution,  direct  calculation  with  appropriate  combining  of  terms  yields  that  (1.18) 
simplifies  to  X2.  In  Section  16.3  we  provide  a  formal  proof  in  a  more  general  setting. 

This  argument  is  similar  to  Pearson’s  in  1900.  R.  A.  Fisher  (1922)  gave  a  simpler 
justification,  the  gist  of  which  follows:  Suppose  that  («],...,  nc)  are  independent  Poisson 
random  variables  with  means  (fM\, . . . ,  fic).  For  large  {/U;}-  the  standardized  values  {zy  = 
(n,  —  lij)/y/JTJ}  have  approximate  standard  normal  distributions.  Thus,  z 2  =  X 2  has 

an  approximate  chi-squared  distribution  with  c  degrees  of  freedom.  Adding  the  single  linear 
constraint  Ylj(nj  ~  My)  —  0,  thus  converting  the  Poisson  distributions  to  a  multinomial, 
we  lose  a  degree  of  freedom. 


1.6  BAYESIAN  INFERENCE  FOR  BINOMIAL  AND 
MULTINOMIAL  PARAMETERS 

This  book  mainly  uses  the  traditional,  so-called frequentist,  approach  to  statistical  inference. 
We  regard  parameter  values  as  fixed  and  apply  probability  statements  to  possible  values  for 
the  data,  given  the  parameter  values.  Recent  years  have  seen  increasing  popularity  of  the 
Bayesian  approach,  which  has  probability  distributions  for  parameters  as  well  as  for  data. 
This  yields  inferences  in  the  form  of  probability  statements  about  possible  values  for  the 
parameters,  given  the  data. 

1.6.1  The  Bayesian  Approach  to  Statistical  Inference 

The  Bayesian  approach  assumes  a  prior  distribution  for  the  parameters.  This  probability 
distribution  may  reflect  subjective  prior  beliefs.  Or,  it  may  reflect  information  about  the 
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parameter  values  from  other  studies.  Or,  it  may  be  relatively  uninformative,  so  that  infer¬ 
ential  results  are  based  almost  entirely  on  the  current  data.  The  prior  distribution  combines 
with  the  information  that  the  data  provide  to  generate  a  posterior  distribution  for  the  pa¬ 
rameters.  Different  choices  for  the  prior  distribution  can  result  in  quite  different  posterior 
inferences,  especially  for  small  sample  sizes,  so  the  choice  should  be  given  careful  thought. 

By  Bayes’  theorem,  the  posterior  probability  density  function  h  of  a  parameter  0 ,  given 
the  data  y,  relates  to  the  probability  mass  function  /  for  y,  given  0 ,  and  the  prior  density 
function  g  for  0,  by 


h(0  |  y)  = 


f(y  1  »)g(Q) 
f(y ) 


The  denominator  f(y)  on  the  right-hand  side  is  the  marginal  probability  mass  function  of 
the  data,  that  is,  fQ  f(y  \  0)g(0)d0.  This  is  a  constant  with  respect  to  0,  so  irrelevant  for 
inference  about  0.  When  we  plug  in  the  observed  data,  f(y  \  0 )  is  the  likelihood  function 
when  viewed  as  a  function  of  6.  So,  the  prior  density  function  for  0  multiplied  by  the 
likelihood  function  determines  the  posterior  density  for  0. 

Except  in  specialized  cases  such  as  presented  in  Sections  1.6.2  and  1.6.3,  there  is  not 
a  closed-form  expression  for  the  posterior  distribution.  The  difficulty  is  in  finding  the 
denominator  integral  that  determines  /(y).  The  key  part  of  the  Bayes  equation  is  the 
numerator,  because  of  the  proportionality  in  terms  of  0 , 


h(0  |  y)  <x  f(y  \  0)g(0). 


Simulation  methods  are  used  to  approximate  the  posterior  distribution.  The  primary  method 
for  doing  this  is  Markov  chain  Monte  Carlo  (MCMC).  It  is  beyond  our  scope  to  discuss  the 
technical  details  of  how  an  MCMC  algorithm  works.  In  a  nutshell,  a  stochastic  process  of 
Markov  chain  form  is  designed  so  that  its  long-run  stationary  distribution  is  the  posterior 
distribution.  One  or  more  such  Markov  chains  provide  a  very  large  number  of  simulated 
values  from  the  posterior  distribution,  and  the  distribution  of  the  simulated  values  approx¬ 
imates  the  posterior  distribution.  Enough  observations  are  taken  after  a  bum-in  period  so 
that  the  Monte  Carlo  error  is  small  in  approximating  the  posterior  distribution  and  summary 
measures  of  interest  for  that  distribution,  such  as  the  mean  and  standard  deviation,  certain 
percentiles,  and  intervals  formed  using  those  percentiles. 

For  an  arbitrary  parameter  such  as  a  coefficient  in  a  regression-type  model,  Bayesian 
methods  of  inference  using  the  posterior  distribution  parallel  those  for  frequentist  inference. 
For  example,  in  lieu  of  P-values,  posterior  tail  probabilities  are  useful.  Information  about 
the  direction  of  an  effect  is  contained  in  the  posterior  probabilities  P{fi  >  0  |  y)  and 
P(P  <  0  |  y).  With  a  flat  prior  distribution,  P(fi  <  0  |  y)  corresponds  to  the  frequentist 
P- value  for  the  one-sided  test  with  Ha\  >0. 

Analogous  to  the  frequentist  confidence  interval  is  an  interval  that  contains  most  of  the 
posterior  distribution.  Such  an  interval  is  referred  to  as  a  posterior  interval  or  credible 
interval.  A  common  approach  for  constructing  a  posterior  interval  uses  percentiles  of 
the  posterior  distribution,  with  equal  probabilities  in  the  two  tails.  For  example,  the  95% 
equal-tail  posterior  interval  for  fi  is  the  region  between  the  2.5  and  97.5  percentiles  of 
the  posterior  distribution  for  ft.  For  unimodal  posteriors,  an  alternative  Bayesian  highest 
posterior  density  (HPD)  interval  has  higher  posterior  density  for  every  value  inside  the 
interval  than  for  every  value  outside  it,  subject  to  the  posterior  probability  over  the  interval 
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equaling  the  desired  confidence  level.  This  method  produces  the  shortest  possible  interval 
with  the  given  level. 

We  next  summarize  the  Bayesian  approach  for  binomial  and  multinomial  parameters. 
Then,  in  the  rest  of  the  book,  we’ll  occasionally  present  Bayesian  alternatives  to  frequentist 
model-based  inference. 

1.6.2  Binomial  Estimation:  Beta  and  Logit-Normal  Prior  Distributions 

The  simplest  Bayesian  inference  for  a  binomial  parameter  ir  uses  a  member  of  the  beta 
distribution  as  the  prior  distribution.  The  beta(ai ,  a 2)  probability  density  function  for  it  is 
proportional  to 


7r“'-‘(l  -7 rf2-1. 

The  parameters  a\  >  0  and  a2  >  0  of  the  prior  are  often  referred  to  as  hyperparameters , 
to  distinguish  them  from  the  parameter  that  is  the  object  of  inference  (in  this  case,  n).  The 
beta  distribution  has 

E(n)  —  a\/(a{  +  a2)  and  var(?r)  =  a\a2/[(ct\  +  aiV^ari  +  a2  +  1)]. 

The  family  of  beta  probability  density  functions  has  a  wide  variety  of  shapes  over  the 
interval  (0,  1),  including  uniform  when  a  1  =  a2  =  1,  unimodal  symmetric  (ai  =  a2  >  D. 
unimodal  skewed  left  (<*1  >  a2  >  1),  unimodal  skewed  right  ( a2  >  a\  >  1),  and  bimodal 
U-shaped  (a  1  <  l,ar2  <  U- 

Often  prior  knowledge  about  it  can  be  expressed  in  terms  of  a  mean  and  standard  devi¬ 
ation  for  a  prior  for  n .  Then,  the  one-to-one  correspondence  between  those  moments  and 
(ai ,  a2)  based  on  the  above  moment  expressions  determines  a  beta  prior.  By  contrast,  lack 
of  prior  knowledge  about  n  might  suggest  using  a  uniform  prior  distribution.  The  posterior 
distribution  then  has  the  same  shape  as  the  binomial  likelihood  function.  Alternatively,  a 
popular  prior  distribution  with  Bayesians  is  the  Jeffreys  prior ,  which  is  proportional  to 
the  square  root  of  the  determinant  of  the  Fisher  information  matrix  for  the  parameters  of 
interest.  With  this  approach,  prior  distributions  for  different  scales  of  measurement  for  the 
parameters  (e.g.,  for  n  or  for  cp  =  log[7r/(l  —  7r)])  are  equivalent.  For  a  binomial  parameter, 
the  Jeffreys  prior  is  the  beta  distribution  with  a  1  =  a2  =  0.5. 

The  beta  distribution  is  the  conjugate  prior  distribution  for  inference  about  a  binomial 
parameter.  This  means  that  it  is  the  family  of  probability  distributions  such  that,  when 
combined  with  the  likelihood  function,  the  posterior  distribution  falls  in  the  same  family. 
When  we  combine  a  beta(oq ,  a2)  prior  distribution  with  a  binomial  likelihood  function,  the 
posterior  distribution  is  a  beta(y  +  a\,n  —  y  +  a2)  distribution,  for  which  the  mean  is 

y  +  a  1  _/  n  V  _  /  ai+a2  \  a\ 

n+a  \+a2  \n  +ci]  +a2J  \n  +  a\  +  a2J  oq  +  a2' 

This  is  a  weighted  average  of  the  sample  proportion  ft  =  y/n  and  the  prior  mean,  with 
more  weight  given  the  sample  proportion  as  n  increases.  Conjugate  priors  were  the  pri¬ 
mary  method  of  conducting  Bayesian  analysis  before  the  development  of  computationally 
intensive  methods,  such  as  Markov  chain  Monte  Carlo,  for  evaluating  the  integral  that 
determines  the  posterior  distribution. 
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An  alternative  prior  distribution  assumes  a  normal  distribution  for  the  logit  parameter, 
log[7T /( 1  —  7r )].  This  parameter,  which  is  relevant  for  many  analyses  presented  in  this  book, 
takes  values  over  the  entire  real  line.  With  a  N( 0,  o 2)  prior  distribution  for  log[7r/(l  —  7 r)], 
on  the  it  scale  the  shape  of  this  logit-normal  (also  called  logistic-normal )  density  is  sym¬ 
metric3,  being  unimodal  when  o1  <  2  and  bimodal  when  er2  >  2,  but  always  tapering  off 
toward  0  as  7r  approaches  0  or  1.  Specifically,  it  is  mound-shaped  for  small  er,  roughly 
uniform  except  near  the  boundaries  when  a  1.5,  and  with  more  pronounced  peaks  for 
the  modes  when  cs  is  about  2  or  larger.  The  peaks  for  the  modes  get  closer  to  0  and  1  as 
o  increases  further,  and  the  curve  has  appearance  that  is  essentially  U-shaped  when  <7  =  3 
and  similar  to  that  of  a  beta(0.5,  0.5)  prior.  For  o  -  (1,  2,  3),  the  standard  deviations  on  the 
it  scale  of  these  priors  are  (0.21, 0.31, 0.37),  similar  to  the  values  (0.22,  0.29,  0.35)  for  the 
beta  priors  with  a\  =  oij  =  (2.0,  1.0,  0.5).  The  logit-normal  prior  with  a  =  2.67  matches 
the  Jeffreys  prior  in  the  first  two  moments  (on  the  probability  scale),  and  the  logit-normal 
prior  with  a  =  1.69  matches  the  uniform  prior  in  the  first  two  moments.  With  a  A(/x,  er2) 
prior  distribution  for  the  logit,  the  density  for  tt  is  skewed  left  when  p.  >  0  and  skewed 
right  when  /x  <  0. 

Yet  another  possibility,  hierarchical  in  nature,  uses  beta  or  logit-normal  priors  but  as¬ 
sumes  a  distribution  for  their  hyperparameters  instead  of  assigning  fixing  values.  That 
second-stage  distribution  may  have  its  own  hyperparameters.  See  Section  3.6.7,  Albert 
(2010),  Good  (1965),  and  Leonard  (1972). 


1.6.3  Multinomial  Estimation:  Dirichlet  Prior  Distributions 

For  c  >  2  categories,  the  beta  distribution  generalizes  to  the  Dirichlet  distribution.  It  is 
defined  over  the  simplex  of  nonnegative  values  n  =  (tt\ ,  . . . ,  7tc)  that  sum  to  1 .  Expressed 
in  terms  of  gamma  functions  and  c  hyperparameters  {a,  >  0),  the  Dirichlet  probability 
density  function  is 


gilt)  =  F[ tiT  1  forO  <  tt,  <  1  all ;,  tt,-  =  1. 

The  case  {a,  =  1 )  is  the  uniform  density  over  the  possible  probability  values.  The  case 
{a,  =  f }  is  the  Jeffreys  prior  for  multinomial  parameters.  Let  K  —  J2,  <*;•  The  Dirichlet 
distribution  has  E{7tj)  —  a,/K  and  var(7r,)  =  <Xj{K  —  <Xj)/[K2{K  +  1)].  For  particular  rel¬ 
ative  sizes  of  {a,  },  such  as  identical  values,  the  distribution  is  more  tightly  concentrated 
around  the  means  as  K  increases. 

Let  «  =  («],...,  ne)  denote  cell  counts  from  n  =  n,  independent  observations  with 
cell  probabilities  ji.  Formula  (1.2)  showed  the  multinomial  probability  mass  function  for 
n.  Multiplying  this  by  the  Dirichlet  prior  density  function  g(7t)  contributes  to  a  posterior 
density  function  h(n  \  n)  for  n  that  is  also  Dirichlet,  but  with  the  hyperparameters  {<*;} 
replaced  by  (a,'  =  n,  +  a,  ).  The  mean  of  the  posterior  distribution  of  zr,  is 

E(ti,  |  n  i . n,)  =  («,•  +  or,- )/(«  +  K ). 

3See  logitnorm.r-forge.r-project.org  and  the  “Logit-normal  distribution”  entry  in 
Wikipedia  .  org  for  figures  illustrating  the  shapes  described  below. 
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Let  Yi  =  E(jtj)  —  a, /K .  This  Bayesian  estimator  equals  the  weighted  average 


of  the  sample  proportion  p,  =  n,  / n  and  the  mean  y,  of  the  prior  distribution  for  tt,  .  This 
posterior  mean  takes  the  form  of  a  sample  proportion  when  the  prior  information  corre¬ 
sponds  to  K  additional  observations  of  which  a,  were  outcomes  of  type  i.  (Well  consider 
a  formal  way  of  setting  such  data  augmentation  priors  in  Section  7.2.4.)  With  identical 
(a,  ),  the  Bayes  estimate  shrinks  each  sample  proportion  toward  the  equi-probability  value 
yi  =  1  /c.  Greater  shrinkage  occurs  as  K  increases,  for  fixed  n. 

Bayesian  estimators  of  multinomial  parameters,  unlike  the  sample  proportions,  are 
slightly  biased  for  finite  n.  Usually,  though,  they  have  smaller  total  mean  squared  error 
(MSE)  than  the  sample  proportions.  They  are  not  uniformly  better  for  all  possible  parameter 
values,  however.  For  instance,  if  a  particular  ji ,■  =  0,  then  p ,  =  0  with  probability  one,  so 
the  sample  proportion  is  then  better  than  any  other  estimator.  We  do  not  expect  n ,  =  0  in 
practice,  and  the  parameter  space  is  often  defined  under  the  restriction  that  all  it-,  >  0,  but 
this  limiting  behavior  explains  why  the  ML  estimator  can  have  smaller  MSE  than  the  Bayes 
estimator  when  m  is  very  near  0. 

1.6.4  Example:  Estimating  Vegetarianism  Revisited 

In  Section  1.4.3  we  estimated  the  population  proportion  of  vegetarians  with  a  sample 
of  size  n  —  25  for  which  y  —  0.  The  ML  estimate  of  tz  is  ft  —  0.0,  and  the  95%  score 
confidence  interval  is  (0.0,  0.133).  How  does  this  compare  to  Bayesian  point  and  interval 
estimates? 

First,  we  use  a  uniform  prior  distribution  for  n ,  reflecting  prior  ignorance.  For  this  beta(  1 , 
1 )  prior  with  y  —  0  and  n  —  25,  the  posterior  distribution  is  beta(  1 , 26).  The  posterior  mean 
is  1/27  =  0.037.  The  posterior  95%  equal-tail  interval  is  (0.001, 0.132),  the  endpoints  being 
the  2.5  and  97.5  percentiles  of  the  beta  posterior  density.  This  interval  is  similar  to  the 
frequentist  95%  score  interval,  but  the  prior  information  has  the  impact  of  moving  the  left 
boundary  slightly  away  from  0.0.  By  contrast,  since  the  posterior  density  is  proportional 
to  (1  —  7T ) 25  and  hence  monotone  decreasing,  the  95%  highest  posterior  density  (HPD) 
interval  has  lower  limit  of  0  and  upper  limit  that  is  the  95th  percentile  of  the  beta(l,  26) 
density,  which  is  0.109. 

For  contrast,  let’s  use  a  much  more  informative  beta  prior.  Suppose  we  used  a  subjective 
approach  and  were  quite  sure  a  priori  that  tt  falls  between  about  0  and  0. 1 6.  We  might 
summarize  this  by  a  prior  mean  of  0.08  and  standard  deviation  of  0.04.  These  moments 
correspond  to  beta  hyperparameters  of  ct\  —3.6  and  0:2  =  41 .4,  for  which  0.16  is  the  96th 
percentile.  Then,  the  posterior  is  the  beta(3.6,  66.4),  which  has  mean  =  0.051  and  95% 
posterior  equal-tail  interval  of  (0.013,  0.1 14)  and  HPD  interval  of  (0.008,  0.103).  Stronger 
prior  beliefs  result  in  greater  shrinkage  of  the  Bayes  estimate  toward  the  prior  mean  and  a 
narrower  posterior  equal-tail  interval. 

1.6.5  Binomial  and  Multinomial  Estimation:  Improper  Priors 

For  multinomial  data,  the  sample  proportion  p,  is  the  ML  estimate  of  tt,  .  It  results  as  the 
special  case  of  the  Bayesian  estimate  (1.19)  when  each  a,  =  0.  But  when  any  a,  =  0,  the 
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Dirichlet  formula  is  not  a  legitimate  probability  density  function,  as  it  integrates  to  oo 
instead  of  1.  It  is  then  an  example  of  an  improper  prior  distribution.  Bayesian  inference 
sometimes  uses  such  improper  prior  distributions,  as  long  as  the  posterior  distribution  is 
proper  (e.g.,  Lindley  1964).  The  Dirichlet  posterior  is  proper  as  long  as  «,  >  0  for  each  i 
having  a,  =  0. 

For  parameters  that  can  take  value  over  the  entire  real  line,  a  common  improper  distribu¬ 
tion  is  uniform  over  all  real  numbers.  For  a  binomial  parameter  jt,  the  improper  beta(0,0) 
prior  for  tt  corresponds  to  an  improper  uniform  distribution  for  logit(7r).  Haldane  (1948) 
suggested  that  this  prior  is  often  sensible  in  genetics  applications,  such  as  for  mutation  rates 
for  which  log(7r)  might  be  approximately  uniform  for  tt  close  to  0. 


NOTES 

Section  1.1:  Categorical  Response  Data 

1.1  Measurement  scales:  Stevens  (1951)  defined  (nominal,  ordinal,  interval)  scales  of  measure¬ 
ment.  Other  scales  result  from  mixtures  of  these  types.  For  instance,  partially  ordered  scales 
occur  when  subjects  respond  to  questions  having  categories  that  are  ordered  except  for  don’t 
know  or  undecided  categories. 


Section  1.3:  Statistical  Inference  for  Categorical  Data 

1.2  Chi-squared:  Greenwood  and  Nikulin  ( 1 996),  Kendall  and  Stuart  ( 1 979),  and  Lancaster  ( 1 969) 
presented  in-depth  overviews  of  the  chi-squared  distribution.  Cochran  (1952)  presented  a 
historical  survey  of  chi-squared  tests  of  fit.  See  also  Cressie  and  Read  ( 1 989),  Koch  and  Bhapkar 
(1982),  Koehler  (2005),  Moore  (1986b),  Read  and  Cressie  (1988),  and  Watson  (1959). 

1.3  Wald/LR/score:  Disadvantages  of  the  Wald  method  compared  with  the  score  and  likelihood- 
ratio  methods  is  that  it  does  not  apply  when  f)  is  on  the  boundary  of  the  parameter  space  (such 
as  a  sample  proportion  A  =  0)  and  its  results  depend  on  the  parameterization;  inference  based 
on  fi  and  its  SE  is  not  equivalent  to  inference  based  on  a  nonlinear  function  of  it,  such  as  log(j6) 
and  its  SE.  See  Section  5.2.6.  “Higher-order  asymptotics”  improve  on  simple  normal  and  chi- 
squared  approximations  for  distributions  of  these  statistics  (Brazzale  et  al.  2007,  Davison  et  al. 
2006). 


Section  1.4:  Statistical  Inference  for  Binomial  Parameters 

1.4  Score  Cl:  The  superiority  of  the  score  interval  to  the  Wald  interval  for  n  was  shown  by,  among 
others,  Agresti  and  Coull  (1998),  Blyth  and  Still  (1983),  Brown  et  al.  (2001),  Ghosh  (1979), 
Newcombe  (1998a),  and  Schader  and  Schmid  (1990). 

1.5  Continuity  correction:  Using  continuity  corrections  with  large-sample  methods  provides  ap¬ 
proximations  to  exact  small-sample  methods.  We  do  not  present  them,  since  if  you  prefer  an 
exact  method,  with  modem  computational  power  you  can  usually  implement  it  directly  rather 
than  approximate  it.  However,  we'll  see  in  Sections  3.5.5,  3.5.7,  7.3.7,  16.6.1,  and  16.6.4  that 
exact  methods  have  the  disadvantage  that  they  behave  conservatively. 

1.6  Discreteness:  Suppose  a  statistic  T  has  discrete  distribution  with  cdf  F(t).  Then.  F(T)  is 
stochastically  larger  than  uniform  over  [0,  1],  its  cdf  being  everywhere  no  greater  than  that 
of  the  uniform  (Casella  and  Berger  2001,  pp.  77,  434).  Likewise,  a  P- value  based  on  T  has 
null  distribution  stochastically  larger  than  uniform.  In  theory,  we  can  eliminate  issues  with 
discreteness  in  tests  by  performing  a  supplementary  randomization  on  the  boundary  of  a 
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critical  region  (see  Exercise  1.12).  In  rejecting  Hq  at  the  boundary  with  a  certain  probability, 
we  can  obtain  type  I  error  probability  =  a  even  when  a  is  not  an  achievable  E-value.  For  such 
randomization,  the  E-value  is 

randomized  E-value  =  U  x  P(T  =  t0)  +  P(T  >  /„), 

where  U  denotes  a  uniform  (0,  1)  random  variable  (Stevens  1950).  In  practice,  this  is  not  done, 
as  it  is  absurd  to  let  a  random  number  determine  a  decision.  The  mid  E-value  replaces  the 
arbitrary  uniform  multiple  U  x  E(T  =  t0)  by  its  expected  value  0.50  x  P(T  =  t0). 


Section  1.5:  Statistical  Inference  for  Multinomial  Parameters 

1.7  Multinomials:  Other  references  on  testing  a  specified  multinomial  include  Good  et  al.  ( 1 970) 
and  Baglivo  et  al.  (1992).  For  simultaneous  confidence  intervals  for  multinomial  parameters 
and  their  differences,  see  Exercise  1 .36,  Chafai'  (2009),  Fitzpatrick  and  Scott  ( 1 987),  Goodman 
(1965),  and  Sison  and  Glaz  (1995). 


Section  1.6:  Bayesian  Inference  for  Binomial  and  Multinomial  Parameters 

1.8  Beta/Dirichlet  priors:  Agresti  and  Hitchcock  (2005)  surveyed  Bayesian  methods  for  cate¬ 
gorical  data.  Lindley  (1964)  and  Good  (1965)  were  influential  early  articles  about  Bayesian 
estimation  of  multinomial  parameters  using  a  Dirichlet  prior.  Brown  et  al.  (2001)  showed  that 
the  Jeffreys  beta  prior  yields  posterior  intervals  for  the  binomial  parameter  that  perform  well, 
having  actual  coverage  probability  close  to  the  nominal  level.  Good  (1967)  gave  a  Bayesian 
goodness-of-fit  test  that  multinomial  probabilities  are  identical,  using  a  hierarchical  approach 
with  a  symmetric  Dirichlet  prior  that  has  a  log  Cauchy  distribution  for  its  hyperparameter. 

1.9  Loss  functions:  In  decision-theoretic  terms,  the  Bayes  estimator  minimizes  the  posterior  ex¬ 

pected  value  of  a  loss  function  that  measures  the  distance  between  an  estimator  T(y)  and 
a  parameter  6.  It  is  the  posterior  mean  for  squared  error  loss  and  posterior  median  for  abso¬ 
lute  error  loss.  For  loss  function  w(6)(T  —  6)2.  it  is  E[dw(6)\ y]/£[w(0)|y].  With  loss  function 
(T  —  7t)2/[7t(  1  —  7r )]  and  uniform  prior,  the  Bayes  estimator  of  7r  is  the  ML  estimator  p  =  y/n. 
Its  risk  function  (the  expected  loss,  treated  as  a  function  of  it)  is  constant.  Bayes  estimators 
with  constant  risk  are  minimax ,  the  maximum  risk  being  no  greater  than  the  maximum  risk  for 
any  other  estimator.  Johnson  (1971)  showed  that  p  is  an  admissible  estimator,  for  standard  loss 
functions.  For  other  cases,  see  DasGupta  and  Zhang  (2004).  Blyth  ( 1 980)  noted  that  for  large 
n,  E\A  —  it  |  ^/2tt ( 1  —  n)lncn,  where  nc  =  3. 1 4 . . .  is  the  mathematical  constant. 


EXERCISES 

Applications 

1.1  Identify  each  variable  as  nominal,  ordinal,  or  interval. 

a.  UK  political  party  preference  (Labour,  Liberal  Democrat,  Conservative) 

b.  Anxiety  rating  (none,  mild,  moderate,  severe,  very  severe) 

c.  Patient  survival  (in  number  of  months) 

d.  Clinic  location  (London,  Boston,  Madison.  Rochester.  Montreal) 

e.  Response  of  tumor  to  chemotherapy  (complete  elimination,  partial  reduction, 
stable,  growth  progression) 

f.  Favorite  grocery  store  for  UK  residents  (Sainsbury,  Tesco,  Waitrose,  other) 
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1.2  Each  of  100  multiple-choice  questions  on  an  exam  has  four  possible  answers,  one 
of  which  is  correct.  For  each  question,  a  student  guesses  by  selecting  an  answer 
randomly. 

a.  Specify  the  distribution  of  the  number  of  correct  answers. 

b.  Find  the  mean  and  standard  deviation  of  that  distribution.  Would  it  be  surprising 
if  the  student  made  at  least  50  correct  responses?  Why? 

c.  Specify  the  distribution  of  (« i,  «2,  /13 ,  /14),  where  nj  is  the  number  of  times  the 
student  picked  choice  j. 

d.  Find  E(nj)  and  var(«y).  Show  that  cov(«;- ,  n^)  =  —  6.25  and  corr(«;,  «^)  = 
-0.333. 


1.3  An  experiment  studies  the  number  of  insects  that  survive  a  certain  dose  of  an 
insecticide,  using  several  batches  of  insects  of  size  n  each.  The  insects  are  sensitive 
to  factors  that  vary  among  batches  during  the  experiment  but  were  not  measured, 
such  as  temperature  level.  Explain  why  the  distribution  of  the  number  of  insects  per 
batch  surviving  the  experiment  might  show  overdispersion  relative  to  a  bin(«,  7r) 
distribution. 

1.4  In  his  autobiography  A  Sort  of  Life,  British  author  Graham  Greene  described  a  period 
of  severe  mental  depression  during  which  he  played  Russian  roulette.  This  “game” 
consists  of  putting  a  bullet  in  one  of  the  six  chambers  of  a  pistol,  spinning  the 
chambers  to  select  one  at  random,  and  then  firing  the  pistol  once  at  one’s  head. 

a.  Greene  played  this  game  six  times  and  was  lucky  that  none  of  them  resulted  in  a 
bullet  firing.  Find  the  probability  of  this  outcome. 

b.  Suppose  that  he  had  kept  playing  this  game  until  the  bullet  fired.  Let  Y  denote  the 

number  of  the  game  on  which  it  fires.  Explain  why  the  probability  mass  function 
for  Y  is  the  geometric,  p{y )  =  (5/6)v~'(l/6),  y  =  1, 2,  3, _ 

1.5  When  the  2010  General  Social  Survey  asked,  “Please  tell  me  whether  or  not  you 
think  it  should  be  possible  for  a  pregnant  woman  to  obtain  a  legal  abortion  if 
she  is  married  and  does  not  want  any  more  children,”  587  replied  “yes”  and  636 
replied  “no.”  Let  n  denote  the  population  proportion  who  would  reply  “yes.”  Find  the 
P-  value  fortesting  Hq.ji  =  0.50  using  the  score  test,  and  construct  a  95%  confidence 
interval  for  7r.  Interpret  the  results. 

1.6  Refer  to  the  vegetarianism  example  in  Section  1.4.3.  For  testing  Hq:  n  =  0.50 
against  Ha\n  ^  0.50,  show  that: 

a.  The  likelihood-ratio  statistic  equals  2[25  log(25 / 1 2.5)]  =  34.7. 

b.  The  chi-squared  form  of  the  score  statistic  equals  25.0. 

c.  The  Wald  z  or  chi-squared  statistic  is  infinite. 

1.7  In  a  crossover  trial  comparing  a  new  drug  to  a  standard,  n  denotes  the  probability 
that  the  new  one  is  judged  better.  It  is  desired  to  estimate  n  and  test  Hq :  n  =  0.50 
against  Ha:  n  ^  0.50.  In  20  independent  observations,  the  new  drug  is  better  each 
time. 
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a.  Find  and  sketch  the  likelihood  function.  Is  it  close  to  the  quadratic  shape  that 
large-sample  normal  approximations  utilize? 

b.  Give  the  ML  estimate  of  n .  Conduct  a  Wald  test  and  construct  a  95%  Wald 
confidence  interval  for  n .  Are  these  sensible? 

c.  Conduct  a  score  test,  reporting  the  P- value.  Construct  a  95%  score  confidence 
interval.  Interpret. 

d.  Conduct  a  likelihood-ratio  test  and  construct  a  likelihood-based  95%  confidence 
interval.  Interpret. 

e.  Construct  an  exact  binomial  test.  Interpret. 

1.8  Refer  to  the  previous  exercise.  Suppose  you  wanted  a  large  enough  sample  to  estimate 
the  probability  of  preferring  the  new  drug  to  within  0.05,  with  confidence  0.95.  If 
the  true  probability  is  0.80,  about  how  large  a  sample  is  needed? 

1.9  In  an  experiment  on  chlorophyll  inheritance  in  maize,  for  1103  seedlings  of  self- 
fertilized  heterozygous  green  plants,  854  seedlings  were  green  and  249  were  yellow. 
Theory  predicts  the  ratio  of  green  to  yellow  is  3: 1 .  Test  the  hypothesis  that  3: 1  is  the 
true  ratio.  Report  the  P-value,  and  interpret. 

1.10  Table  1.3  contains  Ladislaus  von  Bortkiewicz’s  data  on  deaths  of  soldiers  in  the 
Prussian  army  from  kicks  by  army  mules  (Fisher  1934,  Quine  and  Seneta  1987). 
The  data  refer  to  10  army  corps,  each  observed  for  20  years.  In  109  corps-years 
of  exposure,  there  were  no  deaths,  in  65  corps-years  there  was  one  death,  and  so 
on.  Estimate  the  mean  and  test  whether  probabilities  of  occurrences  in  these  five 
categories  follow  a  Poisson  distribution  (truncated  for  4  and  above). 

1.11  A  binomial  experiment  tests  Hq:  n  =  0.50  against  Ha:  n  ^  0.50  using  significance 
level  0.05.  Only  n t  =  5  observations  are  available.  Show  that  the  true  null  probability 
of  rejecting  Hq  is  0.00  for  an  exact  binomial  test  and  using  the  large-sample  score 
test. 

1.12  A  researcher  routinely  tests  using  a  nominal  P(type  I  error)  =  0.05,  rejecting  Hq  if 
the  P-value  <  0.05.  An  exact  test  using  test  statistic  T  has  null  distribution  P(T  = 
0)  =  0.30,  P(T  =  1)  =  0.62,  and  P(T  =  2)  =  0.08,  where  a  higher  T  provides 
more  evidence  against  the  null. 


Table  1.3  Data  on  Deaths  by  Mule  Kicks,  for 
Exercise  1.10 


Number  of  Deaths 

Number  of  Corps- Years 

0 

109 

1 

65 

2 

22 

3 

3 

4 

1 

>5 

0 
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a.  With  the  usual  P- value,  show  that  the  actual  /’(type  I  error)  =  0. 

b.  With  the  mid  P-value,  show  that  the  actual  /’(type  I  error)  =  0.08. 

c.  Find  /’(type  I  error)  in  parts  (a)  and  (b)  when  P(T  =  0)  =  0.30,  P(T  —  1)  = 
0.66,  P(T  =2)  =  0.04.  Note  that  the  test  with  mid  /’-value  can  be  conservative 
or  liberal.  The  exact  test  with  ordinary  /’-value  cannot  be  liberal. 

d.  In  part  (a),  a  randomized-decision  test  generates  a  uniform  random  variable 
U  from  [0,  1]  and  rejects  Ho  if  both  T  =2  and  U  <  |.  Show  the  actual 
/•(type  I  error)  =  0.05.  Is  this  a  sensible  test? 


1.13  The  2006  General  Social  Survey  asked  respondents  how  much  government  should 
spend  on  culture  and  the  arts,  with  categories  (much  more,  more,  the  same,  less, 
much  less).  For  18-21  year-old  females,  the  counts  in  these  categories  were  (0, 
8,  10,  9,  1).  Find  the  Bayes  estimates  of  the  population  proportions  based  on  a 
Dirichlet  prior  distribution  with  (a,  =  K /5 }  for  values  of  K  —  1,  2.5,  5.  For  each 
case,  compare  the  estimate  for  the  “much  more”  category  to  the  ML  estimate. 


1.14  Refer  to  Example  1 .6.4  on  estimating  the  proportion  of  vegetarians.  For  the  Jeffreys 
prior,  find  the  posterior  mean,  the  posterior  95%  equal-tail  interval,  and  the  95% 
highest  posterior  density  interval. 


1.15  You  plan  to  use  Bayesian  methods  to  estimate  binomial  parameters  in  two  cases, 
using  n  observations.  In  case  (1)  you  want  to  estimate  the  probability  that  a  new 
treatment  for  skin  cancer  is  effective.  In  case  (2)  you  want  to  estimate  the  probability 
of  a  head  when  you  repeatedly  flip  a  particular  coin.  Select  prior  distributions  that 
you  think  would  be  sensible  for  each  case.  If  they  differ,  explain  why. 


Theory  and  Methods 

1.16  It  is  easier  to  get  a  precise  estimate  of  the  binomial  parameter  when  n  is  near  0  or  1 
than  when  it  is  near  j.  Explain  why. 

1.17  Suppose  that  /’(T,  =  1)  =  1  —  /’(T,  =  0)  =  7t,  i  =  1, . . . ,  n,  where  (T,  }  are  inde¬ 
pendent.  Let  Y  =  Yi  ■ 

a.  What  is  the  distribution  of  T?  What  are  E(Y )  and  var(T)? 

b.  When  {T,  }  instead  have  pairwise  correlation  p  >  0,  show  that  var(  Y )  >  n jt ( 1  — 
7 r),  overdispersion  relative  to  the  binomial.  [Altham  (1978)  and  Ochi  and  Prentice 
(1984)  discussed  generalizations  of  the  binomial  that  allow  correlated  trials.] 

c.  Suppose  that  heterogeneity  exists:  /’(T,  =  1  \n)  =  tt  for  all  i.  but  7T  is  a  random 
variable  with  density  function  g(-)  on  [0,  1]  having  mean  p  and  positive  variance. 
Show  that  var(T)  >  n  p(l  —  p).  (When  n  has  a  beta  distribution,  Y  has  the  beta- 
binomial  distribution  of  Section  14.3.) 

1.18  For  a  sequence  of  independent  Bernoulli  trials,  let  Y  be  the  number  of  successes 
before  the  /cth  failure.  Explain  why  its  probability  mass  function  is  the  negative 
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binomial , 

p(y)  =  —  +  k  ~ 71  )*’  y  =  o,  1,2, ... . 

y\(k  -  1)! 

[For  it,  E(Y)  =  kn/(\  —  n)  and  var(T)  =  kn/{  1  —  7t)2,  so  var(T)  >  E(Y)\  the 
Poisson  is  the  limit  as  k  — >  oo  and  n  — >  0  with  kn  =  /x  fixed.] 

1.19  For  the  multinomial  distribution,  show  that 

corr(« j,  nit)  =  -jtjJtk/Jjtj(l  -  itj)nk(\  -  nk). 

When  c  —  2,  show  that  this  simplifies  to  corr(ni,  nY)  =  —  1,  and  explain  why  this 
makes  intuitive  sense. 

1.20  Show  that  the  moment  generating  function  (mgf)  is  (a)  m(t)  =  (1  —  n  +  Tie1)"  for 
the  binomial  distribution,  (b)m(t)  =  exp{/r[exp(t)  —  1])  forthe  Poisson  distribution. 
For  each  distribution,  use  them  to  obtain  the  first  two  moments  and  to  show  a 
reproductive  property. 

1.21  A  likelihood-ratio  statistic  equals  ta.  At  the  ML  estimates,  show  that  the  data  are 
exp(t„/2)  times  more  likely  under  Ha  than  under  Hq. 

1.22  Suppose  that  y  \ ,  yi, . . . ,  y„  are  independent  from  a  Poisson  distribution. 

a.  Obtain  the  likelihood  function.  Show  that  the  ML  estimator  (x  =  y. 

b.  Construct  a  large-sample  test  statistic  for  H0 :  M  =  Mo  using  (i)  the  Wald  method, 
(ii)  the  score  method,  and  (iii)  the  likelihood-ratio  method. 

c.  Explain  how  to  construct  a  large-sample  confidence  interval  for  \x  using  (i)  the 
Wald  method,  (ii)  the  score  method,  and  (iii)  the  likelihood-ratio  method. 

1.23  Inference  for  Poisson  parameters  can  often  be  based  on  connections  with  binomial 
and  multinomial  distributions.  Show  how  to  test  H$\  Mi  =  M2  for  two  populations 
based  on  independent  Poisson  counts  (yi,  y2),  using  a  corresponding  binomial  test. 
[Hint:  Condition  on  n  =  yj  +  yi  and  identify  it  =  mi/(Mi  +  M2)  ]  How  can  you 
construct  a  confidence  interval  for  M1/M2  based  on  one  for  7 r? 

1.24  Since  the  Wald  confidence  interval  for  a  binomial  parameter  n  is  degenerate  when 
if  =  0  or  1,  argue  that  the  probability  that  the  interval  covers  it  cannot  exceed 
[1  —  7r"  —  (1  —  7r)"];hence, the infimumofthecoverageprobability overO  <  7r  <  1 
equals  0,  regardless  of  n. 

1.25  We  noted  in  Section  1 .4.2  that  the  midpoint  ft  of  the  score  confidence  interval  (1.14) 
for  it  is  the  sample  proportion  after  adding  z2p  observations  to  the  sample,  half  of 
each  type.  This  motivates  a  simple  confidence  interval. 


ft  ±  ZafisJfHy  —ft)/n*,  where  n*  =  n  +  z2/2. 
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Show  that  the  variance  jf ( 1  —  ft)/ n*  at  the  weighted  average  is  at  least  as  large 
as  the  weighted  average  of  the  variances  that  appears  under  the  square  root  sign 
in  the  score  interval.  [Hint:  Use  Jensen’s  inequality.]  Thus,  this  interval,  which  is 
sometimes  referred  to  as  the  Agresti-Coull  confidence  interval,  contains  the  score 
interval.  [Agresti  and  Coull  (1998)  and  Brown  et  al.  (2001)  showed  that  it  performs 
much  better  than  the  Wald  interval.  It  does  not  have  the  score  interval’s  disadvantage 
(Exercise  16.32)  of  poor  coverage  near  0  and  1 .  With  95%  confidence,  this  motivates 
a  simple  method  that  uses  the  Wald  method  after  adding  2  observations  of  each  type 
(Agresti  and  Coull  1998,  Agresti  and  Caffo  2000);  this  is  sometimes  called  the  plus 
four  confidence  interval.] 

1.26  A  binomial  sample  of  size  n  has  y  =  0  successes. 

a.  Show  that  the  confidence  interval  for  n  based  on  the  likelihood  function  is 
[0.0,  1  -  exp(  — z^/2/2n)].  For  a  =  0.05,  use  the  expansion  of  an  exponential 
function  to  show  that  this  is  approximately  [0,  1  ,92//j], 

b.  For  the  score  method,  show  that  the  confidence  interval  is  [0,  Za/2/(«  +  Za/2>]’ 
or  [0,  3.84/(n  +  3.84)]  when  a  =  0.05.  (See  Exercise  16.30  for  small-sample 
intervals  when  y  =  0.) 

1.27  Suppose  that  P(T  =  tj)  =  Ttj,  j=  1,...  .  Show  that  £(mid  P -value)  =  0.50. 

[Hint:  Show  that  nfiHj/2  +  itj+\  -I - )  =  (£,-  7tj)"/2.] 

1.28  For  a  statistic  T  with  cdf  F(t)  and  p{t )  =  P(T  —  t),  the  mid  distribution  func¬ 
tion  is  Fmid(f)  —  F(t)  —  0.50 p{t)  (Parzen  1997).  Given  T  =  t0,  show  that  the  mid 
P-value  equals  1  —  F(ta).  (It  also  satisfies  £[/rmid(7')]  =  0.50  and  var[Fmjd(r)]  = 
(1/12){1  -E[p\T)]}.) 

1.29  Genotypes  AA,  Aa,  and  aa  occur  with  probabilities  [02,29(l  —9),  (1  —  9)2]. 
A  multinomial  sample  of  size  n  has  frequencies  (n\,n2,nf)  of  these  three 
genotypes. 

a.  Form  the  log  likelihood.  Show  that  9  —  (2/7 1  +  ni)/f2n\  +  2/jt  +  2nf). 

b.  Show  that  —d2L(9)/d92  —  [{2n\  +  ni)/9 2]  +  [(«2  +  2/23)/(l  —  9)2]  and  that  its 
expectation  is  2n/0(  1  —  9).  Use  this  to  obtain  an  asymptotic  standard  error  of  9. 

c.  Explain  how  to  test  whether  the  probabilities  truly  have  this  pattern. 

1.30  Refer  to  Section  1.5.6  and  the  model  for  pneumonia  infections  in  calves.  Using  the 
likelihood  function  to  obtain  the  information,  show  that  the  approximate  standard 
error  of  ft  is  +Jit{\  —  n  )/n(  1  +  7r). 

1.31  Refer  to  Section  1.5.6,  Let  a  denote  the  number  of  calves  that  got  a  primary,  sec¬ 
ondary,  and  tertiary  infection,  b  the  number  that  received  a  primary  and  secondary 
but  not  a  tertiary  infection,  c  the  number  that  received  a  primary  but  not  a  secondary 
infection,  and  d  the  number  that  did  not  receive  a  primary  infection.  Let  tc  be  the 
probability  of  a  primary  infection.  Consider  the  hypothesis  that  the  probability  of 
infection  at  time  t,  given  infection  at  times  1, ...,/  —  1,  is  also  tc  ,  for  t  =  2,  3.  Show 
that  ft  —  (3 a  +  2 b  +  c)/(3a  +  3b  +  2c  +  d). 
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1.32  Refer  to  quadratic  form  (1.18)  that  leads  to  the  Pearson  chi-squared. 

a.  Verify  that  the  matrix  quoted  in  the  text  for  Eq  1  is  the  inverse  of  T(). 

b.  Show  that  (1.18)  simplifies  to  Pearson’s  statistic  (1.16). 

c.  For  the  zs  statistic  (1.11),  show  that  z|  =  X2  for  c  =  2. 

1.33  Fortesting  Hq :  jtj  —  7T;o ,  j  =  1, . . . ,  c,  using  sample  multinomial  proportions  {A,}, 
the  likelihood-ratio  statistic  (1.17)  is 

G2  =  —2n  ^  Aj  \og(jijo/Aj). 
j 

Show  that  G2  >  0,  with  equality  if  and  only  if  Aj  —  tt/o  for  all  j.  [Hint:  Apply 
Jensen’s  inequality  to  E{—2n  log  X),  where  X  equals  Jtjo/Aj  with  probability  Aj.] 

1.34  For  counts  {«,},  the  power  divergence  statistic  for  testing  goodness  of  fit  (Cressie 
and  Read  1984,  Read  and  Cressie  1988)  is 

2 

———  V  «,  [(«; /A<)x  -  1]  for  -00  <  A.  <  oo. 

A(A  “T  1  ) 

a.  For  k  =  1,  show  that  this  equals  X2. 

b.  As  A.  — >  0,  show  that  it  converges  to  G2.  [Hint:  log  t  =  lim/,^o(f —  1)/ h.] 

c.  As  a  — »  - 1,  show  that  it  converges  to  2  J2  fi,  log (/2, •/«;),  the  minimum  discrim¬ 
ination  information  statistic  (Gokhale  and  Kullback  1978). 

d.  For  k  —  —2,  show  that  it  equals  Jfin,  —  the  Neyman  modified  chi- 

squared  statistic  (Neyman  1949). 

e.  For  k  =  —  show  that  it  equals  4  —  \fpf)2,  the  Freeman-Tukey  statistic 

(Freeman  and  Tukey  1950). 

[Under  regularity  conditions,  their  asymptotic  distributions  are  identical  (Drost  et  al. 
1989).  The  chi-squared  null  approximation  works  best  for  k  near  |.] 

1.35  The  chi-squared  mgf  with  df  =  v  is  m(t)  =  (1  —2t)~v/2,tor\t\  <  j.  Use  it  to  prove 
the  reproductive  property  of  the  chi-squared  distribution. 

1.36  For  the  multinomial  (n,  [i r;})  distribution  with  c  >  2,  a  possible  set  of  score-type 
simultaneous  confidence  limits  for  tij  are  the  solutions  of 

{A  j  -  n  jf  I[tz  j{\  —  7i  j)/ n]  =  (za/2c)2,  j  -l - -  c. 

a.  Using  the  Bonferroni  inequality,  argue  that  for  large  n  these  c  intervals  simulta¬ 
neously  contain  all  {rij}  with  probability  at  least  1  —  a. 

b.  Show  that  the  standard  deviation  of  A  j  —  7i>  is  [7 ij  +  —  (7 ij  -  Jik)2]/n.  Let 

a  =  c(c  —  l)/2.  For  large  n,  explain  why  the  probability  is  at  least  1  —  a  that  the 
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Wald  confidence  intervals 


ifij  -  Ak)  ±  za/2a{[Aj  +  Ak-  (A j  -  A *)2]/»l'/2 

simultaneously  contain  the  a  differences  {7 tj  —  nk\  (Goodman  1965). 

1.37  Consider  the  Bayesian  equal-tail  posterior  interval  for  a  binomial  parameter  n ,  using 
a  beta  or  logit-normal  prior.  When  y  =  0,  explain  why  the  lower  limit  for  tt  can  never 
be  0,  unlike  the  frequentist  approach  based  on  inverting  a  score  or  likelihood-ratio 
test. 

1.38  Consider  estimating  the  ratio  Ttj/ixj  of  two  multinomial  parameters.  Should  the 
estimate  depend  at  all  on  the  counts  in  other  categories? 

a.  With  a  frequentist  approach,  explain  why  the  ML  estimate  of  iti/itj  is 

b.  For  a  Dirichlet  prior,  show  that  using  the  Bayes  estimates  of  7 r,-  and  tt,  to  estimate 
tt,- /jtj  uses  also  the  counts  in  other  categories.  (However,  the  posterior  distribution 
of  y  =  7Tj/(7Tj  +  Jtj)  is  the  same  as  its  posterior  distribution  ignoring  the  other 
counts  and  treating  y,  as  binomial  with  sample  size  (y,  +  yj)  and  parameter  y.) 

1.39  Given  tt,  Y  has  abin(/t,  7r)  distribution,  and  n  has  a  uniform  prior  distribution.  Show 
that  the  marginal  distribution  of  Y  is  uniform  over  0,  1 

1.40  Consider  the  Bayes  estimator  of  the  binomial  parameter  it  using  a  beta  prior  distri¬ 
bution. 

a.  Show  that  the  ML  estimator  is  a  limit  of  Bayes  estimators,  for  a  certain  sequence 
of  beta  prior  parameter  values. 

b.  Find  an  improper  prior  density  such  that  the  Bayes  estimator  coincides  with  the 
ML  estimator.  (In  this  sense,  the  ML  estimator  is  a  generalized  Bayes  estimator.) 

1.41  For  the  Dirichlet  prior  for  multinomial  probabilities,  show  the  posterior  expected 
value  of  7T,  is  formula  (1.19).  Derive  the  expression  for  this  Bayes  estimator  as  a 
weighted  average  of  p,  and  E(itj). 


CHAPTER  2 


Describing  Contingency  Tables 


In  this  chapter  we  introduce  parameters  that  summarize  tables  displaying  relationships 
between  categorical  variables.  After  introducing  basic  terminology  and  notation  in  Section 
2.1,  in  Section  2.2  we  introduce  measures  for  comparing  two  groups  on  a  categorical 
response.  The  odds  ratio  has  special  importance,  appearing  as  a  parameter  in  models 
discussed  later.  In  Section  2.3  we  extend  the  scope  by  controlling  for  a  third  variable. 
The  association  can  change  dramatically  under  a  control.  The  chapter’s  primary  focus 
is  binary  variables,  but  in  Section  2.4  we  present  parameters  for  nominal  and  ordinal 
variables. 


2.1  PROBABILITY  STRUCTURE  FOR  CONTINGENCY  TABLES 

Let  X  and  Y  denote  two  categorical  variables,  X  with  I  categories  and  Y  with  J  categories. 
Classifications  of  subjects  on  both  variables  have  IJ  possible  combinations.  When  both 
variables  are  response  variables,  we  focus  on  their  joint  distribution ,  which  also  determines 
the  marginal  and  conditional  distributions.  When  Y  is  a  response  variable  and  X  is  an 
explanatory  variable,  we  focus  on  the  conditional  distribution  of  Y  and  how  it  changes  as 
the  category  of  X  changes. 


2.1.1  Contingency  Tables 

A  rectangular  table  having  I  rows  for  categories  of  X  and  J  columns  for  categories  of  Y 
displays  the  IJ  possible  combinations  of  outcomes.  The  cells  of  the  table  represent  the  IJ 
possible  outcomes.  When  the  cells  contain  frequency  counts  of  outcomes  for  a  sample,  the 
table  is  called  a  contingency  table ,  a  term  introduced  by  Karl  Pearson  (1904).  Another 
name  is  cross-classification  table.  A  contingency  table  with  I  rows  and  /  columns  is  called 
an  /-by-/  (denoted  by  /  x  /)  table. 

Table  2.1,  a  2  x  3  contingency  table,  is  from  a  report  on  the  relationship  between 
aspirin  use  and  heart  attacks  by  the  Physicians’  Health  Study  Research  Group  at  Harvard 
Medical  School.  The  Physicians’  Health  Study  was  a  5-year  randomized  study  of  whether 
regular  aspirin  intake  reduces  mortality  from  cardiovascular  disease.  Every  other  day. 


Categorical  Data  Analysis.  Third  Edition.  Alan  Agresti. 
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Table  2.1  Cross-Classification  of  Aspirin  Use  and 
Myocardial  Infarction 


Myocardial  Infarction 

Fatal  Attack 

Nonfatal  Attack 

No  Attack 

Placebo 

18 

171 

10,845 

Aspirin 

5 

99 

10,933 

Source:  Preliminary  report:  Findings  from  the  aspirin  component  of 
the  ongoing  Physicians’  Health  Study.  N.  Engl.  J.  Med.  318:  262-264, 
1988. 


physicians  participating  in  the  study  took  either  one  aspirin  tablet  or  a  placebo.  The  study 
was  blind — those  in  the  study  did  not  know  whether  they  were  taking  aspirin  or  a  placebo. 
Of  the  1 1,034  physicians  taking  a  placebo,  18  suffered  fatal  heart  attacks  over  the  course 
of  the  study,  whereas  of  the  1 1,037  taking  aspirin,  5  had  fatal  heart  attacks. 


2.1.2  Joint/Marginal/Conditional  Distributions  for  Contingency  Tables 

In  some  applications,  both  X  and  Y  are  response  variables.  Suppose  subjects  are  randomly 
chosen  from  a  particular  population,  such  as  in  a  sample  survey  employing  simple  random 
sampling.  Then,  the  responses  (X,  T)  of  a  randomly  chosen  subject  have  a  probability 
distribution.  Let  n denote  the  probability  that  (X,  Y)  occurs  in  the  cell  in  row  i  and  column 
j.  The  probability  distribution  [mj}  is  the  joint  distribution  of  X  and  Y.  The  marginal 
distributions  are  the  row  and  column  totals  that  result  from  summing  the  joint  probabilities. 
We  denote  these  by  {jr,-+}  for  the  row  variable  and  {tt+j}  for  the  column  variable,  where 
the  subscript  “+”  denotes  the  sum  over  that  index;  that  is, 

Tt;+  -  /^ij  and  tt+j  =  y ^Ttij. 

j  i 

These  satisfy  JA  7 r,-+  =  n+j  —  ttij  —  1 .0.  The  marginal  distributions  provide 

single-variable  information. 

In  most  contingency  tables.  Table  2. 1  being  an  example,  one  variable — say,  Y —  is  a 
response  variable  and  the  other  (X)  is  an  explanatory  variable.  When  X  is  fixed  rather  than 
random,  the  notion  of  a  joint  distribution  forX  and  Y  is  no  longer  meaningful.  However, 
for  a  fixed  category  of  X,  Y  has  a  probability  distribution.  It  is  germane  to  study  how  this 
distribution  changes  as  the  category  of  X  changes.  Given  that  a  subject  is  classified  in  row  i 

ofX,  we  use  itj\i  to  denote  the  probability  of  classification  in  column  /of  Y ,  j  =  1 . J . 

Then,  /  jt/^  —  1 .  The  probabilities  {it\\i, . . . ,  Ttj\i }  form  the  conditional  distribution  of  Y 
at  category  i  of  X.  A  principal  aim  of  many  studies  is  to  compare  conditional  distributions 
of  Y  at  various  levels  of  explanatory  variables. 

When  both  variables  are  response  variables,  descriptions  of  the  association  can  use  their 
joint  distribution,  the  conditional  distribution  of  Y  given  X,  or  the  conditional  distribution 
of  X  given  Y .  The  conditional  distribution  of  Y  given  X  relates  to  the  joint  distribution  by 


Ttj\i  =  it i j / 7tj+  for  all  i  and  j. 
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Table  2.2  Notation  for  Joint,  Conditional,  and 
Marginal  Probabilities 


Row 

1 

Column 

2 

Total 

1 

TTi  i 

7t\2 

Tt\  + 

(7T||l) 

(tt  2|l) 

(1.0) 

2 

7T21 

Tin 

^2+ 

(ZT 1 12) 

(ZT2|2  ) 

(1.0) 

Total 

7T+1 

zr+2 

1.0 

Table  2.2  displays  notation  for  joint,  conditional,  and  marginal  distributions  for  the  2x2 
case.  Sample  distributions  use  similar  notation,  with  p  or  ft  in  place  of  i r.  For  instance, 
{pij}  denotes  the  sample  joint  distribution.  The  cell  frequencies  are  denoted  by  {«,y},  and 
n  =  Y,  Yj  nii *s  total  sample  size.  Thus, 

Pij  -  n,j/n. 

The  sample  proportion  of  times  that  subjects  in  row  i  made  response  j  is  Pj\,  =  p,j/ pi+  — 
njj/ni+,  where  ni+  =  npi+  =  Yj  n>j- 

2.1.3  Example:  Sensitivity  and  Specificity  for  Medical  Diagnoses 

Diagnostic  tests  are  used  to  help  detect  certain  medical  conditions.  These  include  the  PSA 
blood  test  for  prostate  cancer  and  imaging  devices  such  as  the  mammogram  for  diagnosing 
breast  cancer  and  X-rays  and  the  MRI  body  scan.  A  diagnostic  test  for  a  condition  is  said  to 
be  positive  if  it  states  that  the  condition  is  present  and  negative  if  it  states  that  the  condition 
is  absent. 

Breast  cancer  is  the  most  common  form  of  cancer  in  women,  affecting  about  10%  at 
some  time  in  their  lives.  For  the  mammogram  diagnostic  test,  the  chance  of  a  correct  test 
result  varies  according  to  the  breast  density  and  the  radiologist’s  level  of  experience.  Let 
X  =  true  disease  status  (i.e.,  whether  a  woman  truly  has  breast  cancer)  and  let  Y  =  diagnosis 
(positive,  negative).  Table  2.3  shows  typically  reported  values  for  conditional  probabilities 
of  Y  given  X. 

With  a  diagnostic  test,  the  two  correct  diagnoses  are  a  positive  outcome  when  the  person 
has  the  disease  and  a  negative  outcome  when  a  person  does  not  have  it.  Given  that  the  person 
has  the  disease,  the  conditional  probability  that  the  test  is  positive  is  called  the  sensitivity. 


Table  2.3  Estimated  Conditional  Distributions  for 
Breast  Cancer  Mammograms 


Diagnosis  of  Test 

Breast  Cancer 

Positive 

Negative 

Total 

Yes 

0.86 

0.14 

1.0 

No 

0.12 

0.88 

1.0 
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Given  that  the  person  does  not  have  the  disease,  the  conditional  probability  that  the  test  is 
negative  is  called  the  specificity  (Yerushalmy  1947).  Ideally,  these  are  both  very  high. 

For  a  2  x  2  table  with  the  format  of  Table  2.3,  sensitivity  is  n\\\  and  specificity  is  7t2\2-  In 
Table  2.3,  the  estimated  sensitivity  of  mammography  is  0.86.  Of  women  with  breast  cancer, 
86%  are  diagnosed  correctly.  The  estimated  specificity  is  0.88.  Of  women  not  having  breast 
cancer,  88%  are  diagnosed  correctly. 

2.1.4  Independence  of  Categorical  Variables 

Two  categorical  response  variables  are  defined  to  be  independent  if  all  joint  probabilities 
equal  the  product  of  their  marginal  probabilities, 

Ttjj  —  7Tj+7T+J  for  i  =  1 ,...,/  and  j  —  1 , . . . ,  J.  (2.1) 

When  X  and  Y  are  independent, 

7Tj\i  =  7r,7/7T,+  =  (7Ti  +  7t+j)/7Ti+  =  7T+ j  for  /  =  1 . I. 

Each  conditional  distribution  of  Y  is  identical  to  the  marginal  distribution  of  Y. 

Thus,  two  variables  are  independent  when  |  =•••=:  7r7|, ,  for  j  =  1 . 7);  that  is, 

the  probability  of  any  given  column  response  is  the  same  in  each  row.  When  Y  is  a  response 
and  X  is  an  explanatory  variable,  this  is  a  more  natural  way  to  define  independence  than 
(2. 1 ).  Independence  is  then  often  referred  to  as  homogeneity  of  the  conditional  distributions. 

2.1.5  Poisson,  Binomial,  and  Multinomial  Sampling 

The  probability  distributions  introduced  in  Section  1 .2  extend  to  cell  counts  in  contingency 
tables.  For  instance,  a  Poisson  sampling  model  treats  cell  counts  [Yjj]  as  independent 
Poisson  random  variables  with  parameters  {py}.  The  joint  probability  mass  function  for 
potential  outcomes  {«,7}  is  then  the  product  of  the  Poisson  probabilities  P ( YtJ  =  n,/)  for 
the  IJ  cells,  or 


Poisson  sampling:  nn  exp(— /x, j )pnjj  / n, j ! . 

'  j 

When  the  total  sample  size  n  is  fixed  but  the  row  and  column  totals  are  not,  a  multinomial 
sampling  model  applies.  The  IJ  cells  are  the  possible  outcomes.  The  probability  mass 
function  of  the  cell  counts  has  the  multinomial  form 

multinomial  sampling:  [n\/{n\\ !  •  •  •  nu  !)]  |~|  |~|  rt'--1 . 

'  j 

When  observations  on  a  response  Y  occur  separately  at  each  setting  of  an  explanatory 
variable  X,  it  is  natural  to  treat  row  totals  as  fixed.  For  simplicity,  we  then  use  the  notation 
n j  =  nj+.  Suppose  that  the  /;,  observations  on  Y  at  setting  i  of  X  are  independent,  each 
with  probability  distribution  { zr 1 1 , . . . ,  7T/|, }.  The  counts  {«,,,  j  =  1, . . . ,  J]  satisfying 
Yljnij  =  then  have  multinomial  form.  When  samples  at  different  settings  of  X  are 
independent,  the  joint  probability  function  for  the  entire  data  set  is  the  product  of  the 
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multinomial  functions  from  the  various  settings.  This  sampling  scheme  is  independent 
multinomial  sampling, 


independent  multinomial  sampling:  j~~| 

i 

also  called  product  multinomial  sampling.  The  special  case  J  =  2  is  independent  binomial 
sampling. 

Independent  multinomial  sampling  also  results  under  the  following  conditions:  Suppose 
that  { mj )  result  from  either  independent  Poisson  sampling  with  means  }  or  multinomial 
sampling  over  the//  cells  with  probabilities  [jt  ,y  =  Pij/n).  When  X  is  an  explanatory  vari¬ 
able,  it  is  sensible  to  perform  statistical  inference  conditional  on  the  totals  j n,  =  ]T\  ms } 
even  when  their  values  are  not  fixed  by  the  sampling  design.  Conditional  on  {«,),  the  cell 
counts  {mj,  j  —  1, . . . ,  7}  have  the  multinomial  distribution  (2.2)  with  response  probabil¬ 
ities  {jtj\,  =  pL,j/pij+,  j  =  1, . . . ,  /},  and  cell  counts  from  different  rows  are  independent. 
With  this  conditioning,  we  treat  the  row  totals  as  fixed  and  analyze  the  data  as  if  they  formed 
separate  independent  samples. 

Sometimes  both  row  and  column  margins  are  naturally  fixed.  The  appropriate  sampling 
distribution  is  then  usually  the  hypergeometric.  This  case,  considered  in  Section  3.5.1,  is 
less  common. 


nr. 

L  nJnu 


n 


'j  i' 


(2.2) 


2.1.6  Example:  Seat  Belts  and  Auto  Accident  Injuries 

Researchers  in  the  Massachusetts  Department  of  Transportation  (MassDOT)  plan  to  study 
the  effects  of  cell-phone  use  and  seat-belt  use  on  incidence  and  severity  of  traffic  accidents. 
For  the  relationship  between  seat-belt  use  (yes,  no)  and  outcome  of  an  automobile  accident 
(fatality,  nonfatality)  for  drivers  involved  in  accidents  on  the  Massachusetts  Turnpike,  they 
will  summarize  results  in  the  format  shown  in  Table  2.4.  They  plan  to  catalog  all  accidents 
on  the  turnpike  for  the  next  year,  classifying  each  according  to  these  variables.  The  total 
sample  size  is  then  a  random  variable.  They  might  treat  the  numbers  of  observations  at  the 
four  combinations  of  seat-belt  use  and  outcome  of  crash  as  independent  Poisson  random 
variables  with  unknown  means  {/Xu,  l1 12,  Pi],  P-n)- 

Suppose,  instead,  that  the  researchers  randomly  sample  200  police  records  of  accidents 
on  the  turnpike  in  the  past  year  and  classify  each  according  to  seat-belt  use  and  outcome 
of  the  accident.  For  this  study,  the  total  sample  size  n  is  fixed.  They  might  then  treat  the 
four  cell  counts  as  a  multinomial  random  variable  with  n  =  200  trials  and  unknown  joint 
probabilities  {n\  1,^12,  tti\ ,  ^22)- 


Table  2.4  Seat-Belt  Use  and  Results  of  Automobile 
Accidents 


Result  of  Accident 

Seat-Belt  Use 

Fatality  Nonfatality 

Yes 

No 
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Suppose,  instead,  that  police  records  for  accidents  involving  fatalities  were  filed  sep¬ 
arately  from  the  others.  The  researchers  might  instead  randomly  sample  100  records  of 
accidents  with  a  fatality  and  randomly  sample  100  records  of  accidents  with  no  fatality. 
This  approach  fixes  the  column  totals  in  Table  2.4  at  100.  They  might  then  regard  each  col¬ 
umn  of  Table  2.4  as  an  independent  binomial  sample.  Yet  another  approach,  the  traditional 
experimental  design,  takes  200  subjects  and  randomly  assigns  100  of  them  to  wear  seat 
belts  and  the  other  100  not  to  wear  them;  then  the  200  all  are  forced  to  have  an  accident.  The 
recorded  results  would  then  be  independent  binomial  samples  in  each  row,  with  fixed  row 
totals  of  100  each.  (Obviously,  traditional  designs  common  in  some  experimental  science 
may  not  be  ethical  for  humans,  especially  in  some  medical  research.) 

2.1.7  Example:  Case-Control  Study  of  Cancer  and  Smoking 

Table  2.5  comes  from  one  of  the  first  studies  of  the  link  between  lung  cancer  and  smoking. 
Richard  Doll  and  Austin  Bradford  Hill  investigated  this  with  data  from  20  hospitals 
in  London,  England,  at  a  time  when  many  medical  scientists  thought  that  the  increasing 
rates  of  lung  cancer  in  London  mainly  reflected  increasing  air  pollution,  largely  from  the 
burning  of  coal  (and  thus,  the  frequent  “London  fog”)  before  the  Clean  Air  Act  of  1956. 
In  their  study,  patients  admitted  with  lung  cancer  in  the  preceding  year  were  queried  about 
their  smoking  behavior.  For  each  of  the  709  patients  admitted,  they  recorded  the  smoking 
behavior  of  a  noncancer  patient  at  the  same  hospital  of  the  same  gender  and  within  the  same 
5-year  grouping  on  age.  The  709  cases  in  the  first  column  of  Table  2.5  are  those  having 
lung  cancer  and  the  709  controls  in  the  second  column  are  those  not  having  it.  A  smoker 
was  defined  as  a  person  who  had  smoked  at  least  one  cigarette  a  day  for  at  least  a  year. 

Normally,  whether  lung  cancer  occurs  is  a  response  variable  and  smoking  behavior  is 
an  explanatory  variable.  In  this  study,  however,  the  marginal  distribution  of  lung  cancer  is 
fixed  by  the  sampling  design,  and  the  outcome  measured  is  whether  the  subject  ever  was 
a  smoker.  The  study,  which  uses  a  retrospective  design  to  “look  into  the  past,”  is  called  a 
case-control  study.  Such  studies  are  common  in  health-related  applications.  Often,  the  two 
samples  are  matched,  as  in  this  study.  Sometimes  the  samples  of  cases  and  controls  are 
independent  rather  than  matched.  For  instance,  another  early  case-control  study  on  lung 
cancer  and  smoking  sampled  subjects  by  sending  letters  to  the  estates  of  physicians  who 
had  died  of  some  type  of  cancer  in  1950  or  1951,  and  observations  were  cross-classified  on 
type  of  cancer  and  the  subject’s  smoking  behavior  (Cornfield  1956). 

We  might  want  to  compare  smokers  with  nonsmokers  in  terms  of  the  proportion  who 
suffered  lung  cancer.  These  proportions  refer  to  the  conditional  distribution  of  lung  cancer. 


Table  2.5  Cross-Classification  of  Smoking  by 
Lung  Cancer 


Smoker 

Lung  Cancer 

Cases 

Controls 

Yes 

688 

650 

No 

21 

59 

Total 

709 

709 

Source:  Based  on  data  reported  in  Table  IV,  R.  Doll  and 
A.  B.  Hill,  Br.  Med.  J.,  739-748,  Sept.  30,  1950. 
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given  smoking  behavior.  Instead,  case-control  studies  provide  proportions  in  the  reverse 
direction,  for  the  conditional  distribution  of  smoking  behavior,  given  lung  cancer  status.  For 
those  in  Table  2.5  with  lung  cancer,  the  proportion  who  were  smokers  was  688/709  =  0.970, 
while  it  was  650/709  =  0.917  for  the  controls. 

When  we  know  the  proportion  of  the  population  having  lung  cancer,  we  can  use  Bayes’ 
theorem  to  compute  sample  conditional  distributions  in  the  direction  of  main  interest 
(Exercise  2.25).  Otherwise,  using  a  retrospective  sample,  we  cannot  estimate  the  probability 
of  lung  cancer  at  each  category  of  smoking  behavior.  For  Table  2.5  we  do  not  know  the 
population  prevalence  of  lung  cancer,  and  the  patients  suffering  it  were  probably  sampled 
at  a  rate  far  in  excess  of  their  occurrence  in  the  general  population. 

2.1.8  Types  of  Studies:  Observational  Versus  Experimental 

By  contrast  to  the  case-control  study  just  described,  imagine  a  study  that  samples  subjects 
from  the  population  of  teenagers  and  then  60  years  later  measures  the  rates  of  lung  cancer 
for  the  smokers  and  nonsmokers.  Such  a  sampling  design  is  prospective.  There  are  two  types 
of  prospective  studies.  Clinical  trials  randomly  allocate  subjects  to  the  groups  who  will  be 
smokers  and  nonsmokers.  In  cohort  studies ,  subjects  make  their  own  choice  about  whether 
to  smoke,  and  the  study  observes  in  future  time  who  develops  lung  cancer.  Yet  another 
approach,  a  cross-sectional  design ,  samples  subjects  and  classifies  them  simultaneously  on 
both  variables. 

Prospective  studies  usually  condition  on  the  totals  {//,  =  //,/}  for  categories  of  X 

and  regard  each  row  of  /  counts  as  an  independent  multinomial  sample  on  Y.  Retrospective 
studies  treat  the  totals  {«+;}  for  Y  as  fixed  and  regard  each  column  of  /  counts  as  a 
multinomial  sample  on  X.  In  cross-sectional  studies ,  the  total  sample  size  is  fixed  but  not 
the  row  or  column  totals,  and  the  IJ  cell  counts  are  a  multinomial  sample. 

A  clinical  trial  is  an  experimental  study,  the  investigator  having  the  advantage  of  experi¬ 
mental  control  over  which  subjects  receive  each  treatment.  Such  studies  can  use  the  power 
of  randomization  to  make  the  groups  balance  (apart  from  sampling  error)  on  other  variables 
that  may  be  associated  with  the  response.  This  lowers  the  chance  that  an  association  may 
be  due  to  some  unobserved  variable.  By  contrast,  case-control,  cohort,  and  cross-sectional 
studies  are  observational  studies.  They  merely  observe  who  chooses  each  group  and  who 
has  the  outcome  of  interest.  Observational  studies  have  more  potential  for  biases  of  various 
types,  and  it  is  dangerous  to  conclude  that  an  association  reflects  a  causal  connection. 

For  example,  suppose  an  observational  study  finds  that  people  who  are  unmarried  are 
more  likely  to  be  a  member  of  Facebook  than  those  who  are  married.  Many  variables  are 
associated  both  with  marital  status  and  with  whether  a  person  is  a  member  of  Facebook. 
Such  variables  could  account  for  the  association.  One  such  variable  could  be  a  person’s  age. 
Perhaps  younger  people  are  both  more  likely  to  be  a  member  of  Facebook  and  more  likely 
to  be  unmarried.  If  the  study  failed  to  measure  age  or  control  for  it  adequately,  it  might 
misleadingly  predict  a  causal  relation  between  marital  status  and  Facebook  membership. 


2.2  COMPARING  TWO  PROPORTIONS 

Many  studies  are  designed  to  compare  groups  on  a  binary  response  variable.  Then  Y  has 
only  two  categories,  such  as  (success,  failure)  for  outcome  of  a  medical  treatment.  With 
two  groups,  a  2  x  2  contingency  table  displays  the  results.  The  rows  are  the  groups  and 
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the  columns  are  the  categories  of  Y.  This  section  presents  parameters  for  comparing  the 
groups. 


2.2.1  Difference  of  Proportions 

For  subjects  in  row  /,  tT||,  is  the  probability  that  the  response  has  outcome  in  category 
1  (“success”).  With  only  two  possible  outcomes,  iT2\ /  =  1  —  n\\ and  we  use  the  simpler 
notation  n,  for  tti  p- .  The  difference  of  proportions  of  successes,  Tt\  —  712,  is  a  basic  com¬ 
parison  of  the  two  rows.  Comparison  on  failures  is  equivalent  to  comparison  on  successes, 
since 


(1  -7T|)  -  (1  -  7T2)  =  7T2  -  7Ti. 

The  difference  of  proportions  falls  between  —  1 .0  and  + 1 .0.  It  equals  zero  when  the  rows 
have  identical  conditional  distributions.  The  response  Y  is  independent  of  the  row  classifi¬ 
cation  when  7T]  —  jtt  =  0. 

When  both  variables  are  responses,  conditional  distributions  apply  in  either  direction. 
We  can  also  compare  the  two  columns,  such  as  by  the  difference  between  the  proportions 
in  row  1.  This  usually  is  not  equal  to  the  difference  it\  —  H2  comparing  the  rows,  unless 
n\  —  H2  =  0. 

2.2.2  Relative  Risk 

A  value  n\  —  it 2  of  fixed  size  may  have  greater  importance  when  both  zr,  are  close  to  0  or  1 
than  when  they  are  not.  For  a  study  comparing  two  treatments  on  the  proportion  of  subjects 
who  die,  the  difference  between  0.010  and  0.001  is  more  noteworthy  than  the  difference 
between  0.4 1 0  and  0.40 1 ,  even  though  both  are  0.009.  In  such  cases,  the  ratio  of  proportions 
is  also  informative. 

The  relative  risk  is  defined  to  be  the  ratio  of  probabilities. 


relative  risk  —  n  1/^2. 


(2.3) 


It  can  be  any  nonnegative  real  number.  A  relative  risk  of  1.0  corresponds  to  independence. 
Forthe  proportions  just  given,  the  relative  risksareO.  010/0. 001  =  10. Oand  0.410/0. 401  = 
1.02.  Comparing  the  rows  on  the  second  response  category  gives  a  different  relative  risk, 
(1  -  7T,)/(1  -  7T2). 


2.2.3  Odds  Ratio 

For  a  probability  it  of  success,  the  odds  are  defined  to  be 

odds  £2  =  tt/(  1  —  7r). 

The  odds  are  nonnegative,  with  £2  >  1 .0  when  a  success  is  more  likely  than  a  failure.  When 
jr  =  0.75,  for  instance,  then  £2  =  0.75/0.25  =  3.0;  a  success  is  three  times  as  likely  as  a 
failure,  and  we  expect  about  three  successes  for  every  one  failure.  When  £2  =  5,  a  failure 
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is  three  times  as  likely  as  a  success.  Inversely, 


n  =  Q/(Q  +  1 ). 

For  instance,  when  the  odds  Q  —  |,  then  the  probability  n  =  0.25. 

Refer  again  to  a  2  x  2  table.  Within  row  /,  the  odds  of  success  instead  of  failure  are 
Qj  =  it j /( 1  —  7T,).  The  ratio  of  the  odds  Q|  and  Q?  in  the  two  rows, 


Ql  _  7Ti/(l  -  7T )) 
Q2  tt2/(\  -  it2)' 


(2.4) 


is  called  the  odds  ratio. 

For  joint  distributions  with  cell  probabilities  {  tt,  ; } ,  the  equivalent  definition  for  the  odds 
in  row  i  is  £2,  =  7Tn/7Ti2,  i  =  1,  2.  Then  the  odds  ratio  is 


tt\\/rt\2  _  n\\n22 

Tt2\/TC22  Tt\2Tt2\ 


(2.5) 


An  alternative  name  for  6  is  the  cross-product  ratio ,  because  it  equals  the  ratio  of  the 
products  Tt 1 1 ti22  and  n\2Tt2\  of  probabilities  from  diagonally  opposite  cells  (Yule  1900, 
1912). 


2.2.4  Properties  of  the  Odds  Ratio 

The  odds  ratio  can  equal  any  nonnegative  number.  The  condition  £2]  =  £22  and  hence  (when 
all  cell  probabilities  are  positive)  9  =  1  corresponds  to  independence  of  X  and  Y.  When 
1  <  9  <  oc,  subjects  in  row  1  are  more  likely  to  have  a  success  than  are  subjects  in  row  2; 
that  is,  tt\  >  tt2.  For  instance,  when  9=4,  the  odds  of  success  in  row  1  are  four  times  the 
odds  in  row  2.  This  does  not  mean  that  the  probability  n\  =  4n2\  that  is  the  interpretation 
of  a  relative  risk  of  4.0.  When  0  <  9  <  1,  then  ii\  <  n2. 

Values  of  9  farther  from  1.0  in  a  given  direction  represent  stronger  association.  Two 
values  represent  the  same  association,  but  in  opposite  directions,  when  one  is  the  reciprocal 
of  the  other.  For  instance,  when  9  =  0.25,  the  odds  of  success  in  row  1  are  0.25  times  the 
odds  in  row  2,  or  equivalently,  the  odds  of  success  in  row  2  are  1  /0.25  =  4.0  times  the  odds 
in  row  1.  When  the  order  of  the  rows  is  reversed  or  the  order  of  the  columns  is  reversed, 
the  new  value  for  9  is  the  reciprocal  of  the  original  value. 

For  inference,  we  shall  see  it  is  sometimes  convenient  to  use  log#.  Independence 
corresponds  to  log  9=0.  The  log  odds  ratio  is  symmetric  about  this  value — reversal  of 
rows  or  of  columns  results  in  a  change  in  its  sign.  Two  values  for  log  9  that  are  the  same 
except  for  sign,  such  as  log  4  =  1.39  and  log  0.25  =  - 1 .39,  represent  the  same  strength  of 
association. 

The  odds  ratio  does  not  change  value  when  the  orientation  of  the  table  reverses  so  that 
the  rows  become  the  columns  and  the  columns  become  the  rows.  This  is  clear  from  the 
symmetric  form  of  (2.5).  It  is  unnecessary  to  identify  one  classification  as  the  response 
variable  in  order  to  use  6.  In  fact,  although  (2.4)  defined  the  odds  ratio  in  terms  of  odds 
using  n,  =  P(Y  =  1|X  =  ;'),  we  could  just  as  well  define  it  using  reverse  conditional 
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probabilities.  With  a  joint  distribution,  conditional  distributions  exist  in  each  direction,  and 

=  ^11^22  =  P(Y  =  l\X  =  l)/P(Y=2\X  =  1) 

7T127T21  P(y  =  1|X  =  2)/P{Y  =  2\X  =  2) 

=  />(*  =  i|r  =  !)//>(*  =  2|y  =  p 

/>(X  =  l|y  =2)/P(X  =2\Y  =2) 

Because  of  this,  the  odds  ratio  is  equally  valid  for  prospective,  retrospective,  or  cross- 
sectional  sampling  designs.  The  sample  odds  ratio  estimates  the  same  parameter  in  each 
case. 

For  cell  counts  {«,,},  the  sample  odds  ratio  is 


§  =  (n|l«22)/(«12«2l)- 


This  does  not  change  when  both  cell  counts  within  any  row  are  multiplied  by  a  nonzero 
constant  or  when  both  cell  counts  within  any  column  are  multiplied  by  a  nonzero  constant. 
An  implication  is  that  the  sample  odds  ratio  estimates  the  same  characteristic  (0)  even  when 
the  sample  is  disproportionately  large  or  small  from  marginal  categories  of  a  variable.  For  a 
case-control  study  of  the  association  between  vaccination  and  catching  the  flu,  the  sample 
odds  ratio  estimates  the  same  characteristic  with  a  random  sample  of  (1)  100  people  who 
got  the  flu  and  100  people  who  did  not,  or  (2)  40  people  who  got  the  flu  and  160  people 
who  did  not.  The  sample  versions  of  the  difference  of  proportions  and  relative  risk  (2.3) 
are  invariant  to  multiplication  of  counts  within  rows  by  a  constant,  but  they  change  with 
multiplication  within  columns  or  with  row-column  interchange. 

2.2.5  Example:  Association  Between  Heart  Attacks  and  Aspirin  Use 

We  illustrate  the  three  association  measures  with  Table  2. 1  on  aspirin  use  and  heart  attacks. 
The  table  differentiates  between  fatal  and  nonfatal  heart  attacks,  but  we  combine  these 
outcomes  for  now. 

Of  the  1 1 ,034  physicians  taking  placebo,  1 89  suffered  heart  attacks,  a  proportion  of 
189/1 1,034  =  0.0171.  Of  the  1 1,037  taking  aspirin,  104  had  heart  attacks,  a  proportion  of 
0.0094.  The  sample  difference  of  proportions  is  0.0171  —  0.0094  =  0.0077.  The  sample 
relative  risk  is  0.0 1 7 1/0.0094  =  1 .82.  The  proportion  suffering  heart  attacks  of  those  taking 
placebo  was  1.82  times  the  proportion  suffering  heart  attacks  of  those  taking  aspirin.  The 
sample  odds  ratio  is  (189  x  10,933)/(  10, 845  x  104)  =  1.83.  The  odds  of  heart  attack  for 
those  taking  placebo  was  1.83  times  the  odds  for  those  taking  aspirin. 

2.2.6  Case-Control  Studies  and  the  Odds  Ratio 

With  retrospective  sampling  designs,  such  as  case-control  studies,  it  is  possible  to  estimate 
conditional  probabilities  of  form  P(X  =  i\Y  =  /).  It  is  usually  not  possible  to  estimate  the 
probability  P(Y  —  j\X  —  i)  of  an  outcome  of  interest  or  the  difference  of  proportions  or 
relative  risk  for  that  outcome.  It  is  possible  to  estimate  the  odds  ratio,  however,  since  by 
(2.6)  it  is  determined  by  conditional  probabilities  in  either  direction. 

To  illustrate,  we  revisit  Table  2.5  on  X  =  smoking  behavior  and  Y  =  lung  cancer.  The 
data  were  two  binomial  samples  on  X  at  fixed  levels  of  Y .  Thus,  we  can  estimate  the 
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probability  a  subject  was  a  smoker,  given  the  outcome  on  whether  the  subject  had  lung 
cancer;  this  was  688/709  for  the  cases  and  650/709  for  the  controls.  We  cannot  estimate 
the  probability  of  lung  cancer,  given  whether  one  smoked,  which  is  more  relevant.  Thus, 
we  cannot  estimate  differences  or  ratios  of  probabilities  of  lung  cancer.  The  difference 
of  proportions  and  relative  risk  are  limited  to  comparisons  of  the  probabilities  of  being  a 
smoker.  However,  we  can  compute  the  odds  ratio  using  the  sample  analog  of  (2.6), 

(688/709)/(2 1/709)  688  x  59  _  o  ^ 

(650/709)/(59/709)  “  650  x  21  ~ 

Moreover,  by  (2.6),  interpretations  can  use  the  direction  of  interest,  even  though  the  study 
was  retrospective:  The  estimated  odds  of  lung  cancer  for  smokers  were  3.0  times  the 
estimated  odds  for  nonsmokers. 

2.2.7  Relationship  Between  Odds  Ratio  and  Relative  Risk 

From  definitions  (2.3)  and  (2.4), 


odds  ratio  =  relative  risk  I  - 

\1  - 

Their  magnitudes  are  similar  whenever  the  probability  777  of  the  outcome  of  interest  is  close 
to  zero  for  both  groups.  We  saw  this  similarity  in  Section  2.2.5  for  the  aspirin  study,  where 
the  heart  attack  proportion  was  less  than  0.02  for  each  group.  The  relative  risk  was  1 .82 
and  the  odds  ratio  was  1 .83. 

Because  of  this  similarity,  when  each  n,  is  small,  the  odds  ratio  provides  a  rough 
indication  of  the  relative  risk  when  it  is  not  directly  estimable,  such  as  in  case-control 
studies  (Cornfield  1951).  For  instance,  for  Table  2.5,  if  the  probability  of  lung  cancer  is 
small  regardless  of  smoking  behavior,  3.0  is  also  a  rough  estimate  of  the  relative  risk;  that 
is,  for  the  way  smoking  was  defined  in  that  study,  smokers  had  about  3.0  times  the  chance 
of  lung  cancer  as  nonsmokers. 


2.3  CONDITIONAL  ASSOCIATION  IN  STRATIFIED  2x2  TABLES 

An  important  part  of  any  observational  study  is  the  choice  of  control  variables.  In  studying 
the  effect  of  X  on  Y ,  we  should  attempt  to  adjust  or  “control”  any  covariate  that  can 
influence  that  relationship.  This  involves  using  some  mechanism  to  hold  the  covariate 
constant.  Otherwise,  an  observed  effect  of  X  on  Y  may  actually  reflect  effects  of  that 
covariate  on  both  X  and  Y .  The  relationship  between  X  and  Y  then  shows  confounding. 
Experimental  studies  can  remove  effects  of  confounding  covariates  by  randomly  assigning 
subjects  to  different  levels  of  X ,  but  this  is  not  possible  with  observational  studies. 

Suppose  that  a  study  considers  effects  of  passive  smoking,  the  effects  on  a  nonsmoker  of 
living  with  a  smoker.  To  analyze  whether  passive  smoking  is  associated  with  lung  cancer,  a 
cross-sectional  study  might  compare  lung  cancer  rates  between  nonsmokers  whose  spouses 
smoke  and  nonsmokers  whose  spouses  do  not  smoke.  The  study  should  attempt  to  control 
for  age,  socioeconomic  status,  and  other  variables  that  might  relate  both  to  spouse  smoking 
and  to  developing  lung  cancer.  Otherwise,  results  will  have  limited  usefulness.  Spouses  of 
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nonsmokers  may  tend  to  be  younger  than  spouses  of  smokers,  and  younger  people  are  less 
likely  to  have  lung  cancer.  Then  a  lower  proportion  of  lung  cancer  cases  among  spouses  of 
nonsmokers  may  merely  reflect  their  lower  average  age. 

In  this  section  we  discuss  the  analysis  of  the  association  between  categorical  variables  X 
and  Y  while  controlling  for  a  possibly  confounding  variable  Z.  For  simplicity,  the  examples 
refer  to  a  single  control  variable.  In  later  chapters  we  treat  more  general  cases  and  use 
models  to  perform  statistical  control. 

2.3.1  Partial  Tables 

A  three-way  contingency  table  cross-classifies  X ,  T,  and  Z.  We  control  for  Z  by  studying 
the  XY  relationship  at  fixed  levels  of  Z.  Two-way  cross-sectional  slices  of  the  three-way 
table  cross-classify  X  and  Y  at  separate  categories  of  Z.  These  cross  sections  are  called 
partial  tables.  They  display  the  XY  relationship  while  removing  the  effect  of  Z  by  holding 
its  value  constant. 

The  two-way  contingency  table  obtained  by  combining  the  partial  tables  is  called  the 
XY  marginal  table.  Each  cell  count  in  the  marginal  table  is  a  sum  of  counts  from  the  same 
location  in  the  partial  tables.  The  marginal  table,  rather  than  controlling  Z,  ignores  it.  The 
marginal  table  contains  no  information  about  Z.  It  is  simply  a  two-way  table  relating  X  and 
Y  but  may  reflect  the  effects  of  Z  on  X  and  Y . 

The  associations  in  partial  tables  are  called  conditional  associations ,  because  they  refer 
to  the  association  between  X  and  Y  conditional  on  fixing  Z  at  some  level.  Conditional 
associations  in  partial  tables  can  be  quite  different  from  associations  in  marginal  tables.  In 
fact,  it  can  be  misleading  to  analyze  only  marginal  tables  of  a  multiway  contingency  table. 
The  following  example  illustrates. 

2.3.2  Example:  Racial  Characteristics  and  the  Death  Penalty 

Table  2.6  is  a  2  x  2  x  2  contingency  table — two  rows,  two  columns,  and  two  layers — from 
an  article  that  studied  effects  of  racial  characteristics  on  whether  persons  convicted  of 
homicide  received  the  death  penalty.  The  674  subjects  classified  in  Table  2.6  were  the 
defendants  in  indictments  involving  cases  with  multiple  murders  in  Florida  between  1976 
and  1987.  The  variables  in  Table  2.6  are  Y  =  death  penalty  verdict,  having  the  categories 
(yes,  no),  X  =  race  of  defendant,  and  Z  =  race  of  victims,  each  having  the  categories 


Table  2.6  Death  Penalty  Verdict  by  Defendant’s  Race  and  Victims’  Race 


Death  Penalty 

Victims’  Race 

Defendant’s  Race 

Yes 

No 

Percent  Yes 

White 

White 

53 

414 

11.3 

Black 

11 

37 

22.9 

Black 

White 

0 

16 

0.0 

Black 

4 

139 

2.8 

Total 

White 

53 

430 

11.0 

Black 

15 

176 

7.9 

Source :  M.  L.  Radelet  and  G.  L.  Pierce,  Florida  Law  Rev.  43:  1-34,  1991.  Reprinted  with 
permission  from  the  Florida  Law  Review. 
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victims  victims 

Figure  2.1  Percentage  receiving  death  penalty,  by  defendant’s  race  and  victims’  race. 


(white,  black).  We  study  the  effect  of  defendant’s  race  on  the  death  penalty  verdict,  treating 
victims’  race  as  a  control  variable.  Table  2.6  has  a  2  x  2  partial  table  relating  defendant’s 
race  and  the  death  penalty  verdict  at  each  category  of  victims’  race. 

For  each  combination  of  defendant’s  race  and  victims’  race,  Table  2.6  lists  and 
Figure  2. 1  displays  the  percentage  of  defendants  who  received  the  death  penalty.  These 
describe  the  conditional  associations.  When  the  victims  were  white,  the  death  penalty  was 
imposed  22.9%  —  11.3%  =  1 1.6%  more  often  for  black  defendants  than  for  white  defen¬ 
dants.  When  the  victims  were  black,  the  death  penalty  was  imposed  2.8%  more  often  for 
black  defendants  than  for  white  defendants.  Controlling  for  victims’  race  by  keeping  it  fixed, 
the  death  penalty  was  imposed  more  often  on  black  defendants  than  on  white  defendants. 

The  bottom  portion  of  Table  2.6  displays  the  marginal  table.  It  results  from  summing 
the  cell  counts  in  Table  2.6  over  the  two  categories  of  victims’  race,  thus  combining  the 
two  partial  tables  (e.g.,  11+4=  15).  Overall,  11.0%  of  white  defendants  and  7.9%  of 
black  defendants  received  the  death  penalty.  Ignoring  victims’  race,  the  death  penalty  was 
imposed  less  often  on  black  defendants  than  on  white  defendants.  The  association  reverses 
direction  compared  with  the  partial  tables. 

Why  does  the  association  change  so  much  when  we  ignore  versus  control  victims’  race? 
This  relates  to  the  nature  of  the  association  between  victims’  race  and  each  of  the  other 
variables.  First,  the  association  between  victims’  race  and  defendant’s  race  is  extremely 
strong.  The  marginal  table  relating  these  variables  has  odds  ratio  (467  x  143)/(48  x  16)  = 
87.0.  Second,  Table  2.6  shows  that,  regardless  of  defendant’s  race,  the  death  penalty  was 
much  more  likely  when  the  victims  were  white  than  when  the  victims  were  black.  So 
whites  are  tending  to  kill  whites,  and  killing  whites  is  more  likely  to  result  in  the  death 
penalty.  This  suggests  that  the  marginal  association  should  show  a  greater  tendency  than 
the  conditional  associations  for  white  defendants  to  receive  the  death  penalty.  In  fact, 
Table  2.6  has  this  pattern. 

Figure  2.2  illustrates  why  the  marginal  association  differs  so  from  the  conditional  as¬ 
sociations.  For  each  defendant’s  race,  the  figure  plots  the  proportion  receiving  the  death 
penalty  at  each  category  of  victims’  race.  Each  proportion  is  labeled  by  a  letter  symbol 
giving  the  category  of  victims’  race.  Surrounding  each  observation  is  a  circle  having  area 
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Defendant’s  race 


Figure  2.2  Proportion  receiving  death  penalty  by  defendant’s  race,  controlling  and  ignoring  victims’  race. 


proportional  to  the  number  of  observations  at  that  combination  of  defendant’s  race  and 
victims’  race.  For  instance,  the  W  in  the  largest  circle  represents  a  proportion  of  0. 1 1 3 
receiving  the  death  penalty  for  cases  with  white  defendants  and  white  victims.  That  circle 
is  largest  because  the  number  of  cases  at  that  combination  (53  +  414  =  467)  is  largest.  The 
next-largest  circle  relates  to  cases  in  which  blacks  kill  blacks. 

We  control  for  victims’  race  by  comparing  circles  having  the  same  victims’  race  letter 
at  their  centers.  The  line  connecting  the  two  W  circles  has  a  positive  slope,  as  does  the  line 
connecting  the  two  B  circles.  Controlling  for  victims’  race,  this  reflects  the  death  penalty 
being  more  likely  for  black  defendants  than  for  white  defendants.  When  we  add  results 
across  victims’  race  to  get  a  summary  result  for  the  marginal  effect  of  defendant’s  race  on 
the  death  penalty  verdict,  the  larger  circles,  having  the  greater  number  of  cases,  have  greater 
influence.  Thus,  the  summary  proportions  for  each  defendant’s  race,  marked  on  the  figure 
by  periods,  fall  closer  to  the  center  of  the  larger  circles  than  to  the  center  of  the  smaller 
circles.  A  line  connecting  the  summary  marginal  proportions  has  negative  slope,  indicating 
that  overall  the  death  penalty  was  more  likely  for  white  than  for  black  defendants. 

The  result  that  a  marginal  association  can  have  a  different  direction  from  each  conditional 
association  is  called  Simpson’s  paradox  (Simpson  1951),  although  it  was  noted  as  early 
as  in  Yule  ( 1 903).  It  applies  to  quantitative  as  well  as  categorical  variables.  Statisticians 
commonly  use  it  to  caution  against  imputing  causal  effects  from  an  association  of  X  with 
Y.  For  instance,  when  doctors  started  to  observe  association  between  smoking  and  lung 
cancer,  statisticians  such  as  R.  A.  Fisher  warned  that  some  variable  (e.g.,  a  genetic  factor) 
could  exist  such  that  the  association  would  disappear  under  the  relevant  control.  However, 
others  (e.g.,  J.  Cornfield  in  1954,  as  summarized  by  Greenhouse  2009)  showed  that  at  least 
as  strong  an  association  must  exist  between  a  confounding  variable  Z  and  both  X  and  Y  in 
order  for  the  effect  of  X  on  Y  to  disappear  or  change  under  the  control.  See  Breslow  and 
Day  (1980,  Sec.  3.4)  and  Brass  (1967)  for  related  comments. 

2.3.3  Conditional  and  Marginal  Odds  Ratios 

Odds  ratios  can  describe  marginal  and  conditional  associations.  We  illustrate  for  2x2  x  K 
tables,  where  K  denotes  the  number  of  categories  of  a  control  variable,  Z.  Let  \p.,jk)  denote 
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cell  expected  frequencies  for  some  sampling  model,  such  as  binomial,  multinomial,  or 
Poisson  sampling. 

Within  a  fixed  category  k  of  Z,  the  odds  ratio 


8xY(k)  = 


Ml  \k  M22i 
Ml2*  M2 \k 


(2.7) 


describes  conditional  XY  association  in  partial  table  k.  The  conditional  odds  ratios  for  the 
K  partial  tables  can  be  quite  different  from  the  marginal  odds  ratio.  The  XT  marginal  table 
has  expected  frequencies  {mu+  =  12  k  M/yr  }•  The  XY  marginal  odds  ratio  is 


&XY 


Mll+  M  22+ 
Ml2+  M2I  + 


Sample  values  of  OxY(k)  and  Oxy  use  similar  formulas  with  cell  counts  substituted  for 
expected  frequencies.  We  illustrate  for  the  association  between  defendant’s  race  and  the 
death  penalty  in  Table  2.6.  In  the  first  partial  table,  victims’  race  is  white  and 


#XK(  1)  = 


53  x  37 
414  x  11 


0.43. 


The  sample  odds  for  white  defendants  receiving  the  death  penalty  were  43%  of  the  sample 
odds  for  black  defendants.  In  the  second  partial  table,  victims’  race  is  black  and  the 
estimated  odds  ratio  equals  6xy( 2)  =  (0  x  139)/(16  x  4)  =  0.0,  since  the  death  penalty 
was  never  given  to  white  defendants  with  black  victims. 

Estimation  of  the  marginal  odds  ratio  uses  the  2  x  2  marginal  table  within  Table  2.6, 
collapsing  over  victims’  race,  or  (53  x  176)/(430  x  15)  =  1.45.  The  sample  odds  of  the 
death  penalty  were  45%  higher  for  white  defendants  than  for  black  defendants.  Yet  within 
each  victims’  race  category,  those  odds  were  smaller  for  white  defendants.  This  reversal  in 
the  association  after  controlling  for  victims’  race  illustrates  Simpson’s  paradox. 


2.3.4  Marginal  Independence  Versus  Conditional  Independence 

More  generally,  when  X  and  Y  may  have  multiple  categories,  an  1  x  J  x  K  table  describes 
the  relationship  between  X  and  T,  controlling  for  Z.  If  X  and  Y  are  independent  in  partial 
table  k,  then  X  and  Y  are  said  to  be  conditionally  independent  at  level  k  of  Z.  When  Y  is  a 
response,  this  means  that 

P(Y  =  j\X  =  /,  Z  =  k)  =  P(Y  =  j\Z  =  k ),  for  all  /,  j.  (2.8) 

More  generally,  X  and  Y  are  said  to  be  conditionally  independent  given  Z  when  they  are 
conditionally  independent  at  every  level  of  Z,  that  is,  when  (2.8)  holds  for  all  k.  Then,  given 
Z,  Y  does  not  depend  on  X. 

Suppose  that  a  single  multinomial  applies  to  the  entire  three-way  table,  with  joint 
probabilities  {jr;y*  =  P(X  =  i,Y  =  j,  Z  =  £)}.  Then 


itjjk  =  P(X  =  i,Z  =  k )  P(Y  =  j\X  —  i,Z  =  k). 
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Under  conditional  independence  of  X  and  Y,  given  Z,  this  equals 


m+kP(Y  =  j\Z=  k)  =  n,+kP(Y  =  j,Z  =  k)/P{Z  =  k). 


Thus,  conditional  independence  is  then  equivalent  to 


7TUk  —  Tij+k  TT+jk/7i++k  for  all  /',  j,  and  k.  (2.9) 

Conditional  independence  does  not  imply  marginal  independence  (Yule  1903).  For 
instance,  summing  (2.9)  over  k  on  both  sides  yields 

jTC /+  —  'y  '(jtj+k  H+jk/n++k')' 

k 


All  three  terms  in  the  summation  involve  k ,  and  this  does  not  simplify  to  jr,J+  =  7r,++  n +J-+, 
which  is  marginal  independence. 

For  2  x  2  x  K  tables,  X  and  Y  are  conditionally  independent  when  the  odds  ratio 
between  X  and  Y  equals  1.0  at  each  category  of  Z.  The  expected  frequencies  {/iijk}  in 
Table  2.7  illustrate  this  relation  for  Y  =  response  (success,  failure),  X  =  drug  treatment 
(A,  B),  and  Z  =  clinic  (1,2).  From  (2.7),  the  conditional  XY  odds  ratios  are 


&xy(  i)  = 


18  x  8 
12  x  12 


=  1.0, 


&XY(2) 


2  x  32 
8x8 


=  1.0. 


Given  the  clinic,  response  and  treatment  are  conditionally  independent.  The  marginal  table 
combines  the  tables  for  the  two  clinics.  Its  odds  ratio  is  Oxy  =  (20  x  40)/(20  x  20)  =  2.0, 
so  the  variables  are  not  marginally  independent. 

Ignoring  the  clinic,  why  are  the  odds  of  a  success  for  treatment  A  twice  those  for 
treatment  B?  The  conditional  XZ  and  YZ  odds  ratios  give  a  clue.  The  odds  ratio  between  Z 
and  either  X  or  Y,  at  each  fixed  category  of  the  other  variable,  equals  6.0.  For  instance,  the 
XZ  odds  ratio  at  the  first  category  of  Y  equals  (18  x  8)/(12  x  2)  =  6.0.  The  conditional 
odds  (given  response)  of  receiving  treatment  A  at  clinic  1  are  six  times  those  at  clinic  2,  and 
the  conditional  odds  (given  treatment)  of  success  at  clinic  1  are  six  times  those  at  clinic  2. 
Clinic  1  tends  to  use  treatment  A  more  often,  and  clinic  1  also  tends  to  have  more  successes. 


Table  2.7  Expected  Frequencies  Showing  that  Conditional 
Independence  Does  Not  Imply  Marginal  Independence 


Clinic 

Treatment 

Response 

Success 

Failure 

1 

A 

18 

12 

B 

12 

8 

2 

A 

2 

8 

B 

8 

32 

Total 

A 

20 

20 

B 

20 

40 
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For  instance,  if  patients  at  clinic  1  tended  to  be  younger  and  in  better  health  than  those  at 
clinic  2,  perhaps  they  had  a  better  success  rate  regardless  of  the  treatment  received. 

It  is  misleading  to  study  only  the  marginal  table,  concluding  that  successes  are  more  likely 
with  treatment  A.  Subjects  within  a  particular  clinic  are  likely  to  be  more  homogeneous 
than  the  overall  sample,  and  response  is  independent  of  treatment  in  each  clinic. 

2.3.5  Homogeneous  Association 

A  2  x  2  x  K  table  has  homogeneous  XY  association  when 

@XY(D  =  @XY(2)  —  ■  ■  ■  —  &XY(K)- 

Then  the  effect  of  X  on  Y  is  the  same  at  each  category  of  Z.  Conditional  independence  of 
X  and  Y  is  the  special  case  in  which  each  OxY(k)  =  1-0. 

Under  homogeneous  XY  association,  homogeneity  also  holds  for  the  other  associations. 
For  instance,  the  conditional  odds  ratio  between  two  categories  of  X  and  two  categories 
of  Z  is  identical  at  each  category  of  Y .  For  the  odds  ratio,  homogeneous  association  is 
a  symmetric  property.  It  applies  to  any  pair  of  variables  viewed  across  the  categories  of 
the  third.  When  it  occurs,  there  is  said  to  be  no  interaction  between  two  variables  in  their 
effects  on  the  other  variable. 

When  interaction  exists,  the  conditional  odds  ratio  for  any  pair  of  variables  changes 
across  categories  of  the  third.  For  X  =  smoking  (yes,  no),  Y  —  lung  cancer  (yes,  no),  and 
Z  =  age  (<45,  45-65,  >65),  suppose  that  9xyi,\)  —  1.2,  9xy( 2)  =  3.9,  and  6xyq)  =  8.8. 
Then  smoking  has  a  weak  effect  on  lung  cancer  for  young  people,  but  the  effect  strengthens 
considerably  with  age.  Age  is  called  an  effect  modifier,  the  effect  of  smoking  is  modified 
depending  on  the  value  of  age. 

For  the  death  penalty  data  (Table  2.6),  6xn\)  =  0.43  and  9xy( 2)  =  0.0.  The  values  are 
not  close,  but  the  second  estimate  is  imprecise  because  of  the  zero  cell  count.  Because 
of  the  ordinary  variation  that  occurs  from  sampling  variability,  these  partial  tables  do  not 
necessarily  contradict  homogeneous  association  in  a  population. 

Some  analyses  of  categorical  data  assume  homogeneous  association,  and  we'll  also 
see  how  to  test  such  an  assumption.  For  example,  when  each  2x2  table  results  from  a 
particular  study,  the  statistical  analysis  may  combine  information  from  the  various  studies  to 
summarize  the  overall  evidence  against  conditional  independence  and  to  assess  whether  the 
effect  was  the  same  in  each  study.  Such  an  analysis  is  called  a  meta-analysis.  In  Section  6.4 
we  show  how  to  analyze  whether  sample  data  are  consistent  with  homogeneous  association 
or  conditional  independence. 

2.3.6  Collapsibility:  Identical  Conditional  and  Marginal  Associations 

Even  when  conditional  associations  are  identical,  we've  seen  that  they  may  differ  from  a 
marginal  association.  When  do  they  not  differ?  We'll  study  this  in  some  detail  in  Section 
1 0. 1 ,  but  for  now  we’ll  state  two  basic  results,  for  2  x  2  x  K  tables  stratifying  by  categories 
of  Z: 

Collapsibility  of  Odds  Ratios.  When  Oxyo.)  is  identical  at  every  level  k  of  Z,  that  value  equals 
the  marginal  odds  ratio  9xy  if  either  Z  and  X  are  conditionally  independent  or  if  Z  and  Y  are 
conditionally  independent. 


54 


DESCRIBING  CONTINGENCY  TABLES 


Collapsibility  of  Difference  of  Proportions  (or  Relative  Risk).  When  n\  —  it  2  (or  n\  Inf)  is 
the  same  at  every  level  of  Z,  that  value  equals  the  corresponding  marginal  measure  if  Z  is 
independent  of  X  in  the  marginal  XZ  table  or  if  Z  is  conditionally  independent  of  Y  given  X. 

The  conditions  for  odds  ratio  collapsibility  state  that  the  variable  treated  as  the  control 
(Z)  is  conditionally  independent  of  X  or  Y ,  or  both.  For  example,  the  conditional  odds  ratio 
between  defendant’s  race  and  the  death  penalty  verdict  is  collapsible  over  victim’s  race  if 
(1)  for  each  death  penalty  outcome,  victim’s  race  and  defendant’s  race  are  independent, 
or  (2)  for  each  defendant’s  race,  the  chance  of  the  death  penalty  is  the  same  when  the 
victim  was  white  as  when  the  victim  was  black.  The  first  condition  for  collapsibility  of  the 
difference  of  proportions  or  relative  risk  is  satisfied,  for  example,  for  factorial  designs  with 
the  same  number  of  observations  at  each  combination  of  levels  of  X  and  Z.  For  details  and 
extensions,  see  the  references  in  Note  2.3. 


2.4  MEASURING  ASSOCIATION  IN  /  x  J  TABLES 

For  2x2  tables,  a  single  number  such  as  the  odds  ratio  can  summarize  the  association. 
For  7x7  tables,  it  is  usually  not  possible  to  summarize  association  by  a  single  number 
without  some  loss  of  information.  Flowever,  a  set  of  odds  ratios  or  another  summary  index 
can  describe  certain  features  of  the  association. 


2.4.1  Odds  Ratios  in  /  x  J  Tables 

Odds  ratios  can  use  each  of  the  pairs  of  rows  in  combination  with  each  of  the  ^ 
pairs  of  columns.  For  rows  a  and  b  and  columns  c  and  d,  the  odds  ratio  (nac  nh(i)/(nhc  naj) 
uses  four  cells  in  a  rectangular  pattern.  There  are  (  t  t  )  odds  ratios  of  this  type.  This  set 
of  odds  ratios  contains  much  redundant  information. 

Consider  the  subset  off  /  —  1 )(./  —  1 )  local  odds  ratios 

Q  -  XijXl+Yj+i  /  —  |  /  _  l  j  —  1  ...  7  —  1 .  (2.10) 

'  niJ+ini+Uj 

Figure  2.3  shows  that  local  odds  ratios  use  cells  in  adjacent  rows  and  adjacent  columns. 
These  (7  —  1 )( 7  —  1)  odds  ratios  determine  all  odds  ratios  formed  from  pairs  of  rows  and 
pairs  of  columns.  To  illustrate,  in  Table  2. 1 ,  the  sample  local  odds  ratio  is  2.08  for  the  first 
two  columns  and  1.74  for  the  second  and  third  columns.  In  each  case,  the  more  serious 
outcome  was  more  prevalent  for  the  placebo  group.  The  product  of  these  two  odds  ratios  is 
3.63,  which  is  the  odds  ratio  for  the  first  and  third  columns. 

Construction  (2.10)  for  a  minimal  set  of  odds  ratios  is  not  unique.  Another  basic  set  is 


7T,y  nu 

n  1  j  n  jj 


/  =  !,... ,7-1,  7  =  1 , ....  7  —  1 . 


(2.11) 


This  uses  the  rectangular  pattern  of  cells  determined  by  the  cell  in  row  i  and  column  j  and 
the  cell  in  the  last  row  and  last  column.  Figure  2.3  illustrates. 

Given  the  marginal  distributions  {7r,+}  and  {n+j},  when  {n/j  >  0),  conversion  of  the 
probabilities  into  the  set  of  odds  ratios  (2.10)  or  (2.11)  does  not  discard  information. 
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The  cell  probabilities  determine  the  odds  ratios,  and  given  the  marginals,  the  odds  ratios 
determine  the  cell  probabilities.  In  this  sense,  (/  —  \)(J  —  1)  parameters  can  describe  any 
association  in  an  /  x  J  table.  Independence  is  equivalent  to  all  (/  —  1)(7  —  1)  odds  ratios 
equaling  1.0. 

For  three-way  /  x  J  x  K  tables,  sets  of  odds  ratios  in  the  partial  tables  describe  the 
conditional  association.  Homogeneous  XY  association  means  that  a  conditional  odds  ratio 
formed  using  any  particular  two  categories  of  X  and  any  particular  two  categories  of  Y  is 
the  same  at  each  category  of  Z. 


2.4.2  Association  Factors 

An  alternative  type  of  association  summary  focuses  on  individual  cells  and  whether  a  cell 
has  more  or  fewer  subjects  than  we’d  expect  if  the  variables  are  independent.  One  way  to 
do  this  uses  the  1J  association  factors  (Good  1956), 


7riy/(7r/  +  7r+j)- 
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An  association  factor  is  the  ratio  of  the  cell  probability  to  the  probability  correspond¬ 
ing  to  independence  for  the  particular  marginal  distributions.  It  falls  between  0  and 
min(l/:7r,+,  1  /n+j),  with  the  baseline  value  of  1  corresponding  to  independence. 

It  can  be  informative  to  investigate  which  cells  have  probabilities  substantially  different 
from  independence.  For  instance,  we  could  regard  the  departure  from  independence  in  a 
cell  as  being  noteworthy  when  the  association  factor  is  larger  than  2  or  smaller  than 


2.4.3  Summary  Measures  of  Association 

Another  way  to  describe  association  uses  a  single  summary  index.  We  discuss  this  first  for 
nominal  variables  and  then  ordinal  variables.  The  most  interpretable  indices  for  nominal 
variables  have  the  same  structure  as  ^-squared  for  interval  variables.  It  and  the  more  general 
intraclass  correlation  coefficient  and  correlation  ratio  (Kendall  and  Stuart  1979)  describe 
the  proportional  reduction  in  variance  from  the  marginal  distribution  of  the  response  Y  to 
the  conditional  distributions  of  Y  given  an  explanatory  variable  X. 

Let  V(Y)  denote  a  measure  of  variation  for  the  marginal  distribution  \tc+j  )  of  Y,  and  let 
V(Y\i)  denote  this  measure  computed  for  the  conditional  distribution  ...  ,nj\j]  of  Y 
at  the  z'th  setting  of  X.  A  proportional  reduction  in  variation  measure  has  the  form 


F(n-£[V/(T|X)] 

V[Y) 


(2.12) 


where  £[V(T|X)]  is  the  expectation  of  the  conditional  variation  taken  with  respect  to  the 
distribution  of  X.  For  the  marginal  distribution  {7r,+)  of  X,  E[V(Y  | X )]  =  7Ti+  V(Y\i). 

For  a  nominal  response,  Theil  (1970)  proposed  an  index  using  the  variation  mea¬ 
sure  V{Y )  =  n+j  l°g  n+j<  called  the  entropy.  For  contingency  tables,  the  proportional 
reduction  in  entropy  equals 

, ,  £,•  Ej  iog(jr0V*.-+  *+./) 

D  —  i  ’  (Z.lJ) 

EjX+j'OgX+j 

called  the  uncertainty  coefficient.  It  takes  value  between  0  and  1 :  L7  =  0  is  equivalent  to 
independence  of  X  and  Y\U  =  1  is  equivalent  to  a  lack  of  conditional  variation,  in  the 
sense  that  for  each  z,  n j\i  =  1  for  some  j. 

Various  measures  of  form  (2. 12)  describe  association  in  /  x  J  tables  (see  Exercises  2.39 
and  2.40).  A  difficulty  with  them  is  developing  intuition  for  how  large  a  value  constitutes 
a  strong  association.  How  do  we  interpret,  say,  a  30%  reduction  in  entropy?  Summary 
measures  seem  easier  to  interpret  and  more  useful  when  both  classifications  are  ordinal,  as 
discussed  next. 


2.4.4  Ordinal  Trends:  Concordant  and  Discordant  Pairs 

Table  2.8  cross-classifies  job  satisfaction  with  age  for  a  recent  General  Social  Survey 
(GSS).  The  GSS  is  a  probability  sample  of  Americans  conducted  every  other  year.  Both 
classifications  are  ordinal  as  measured,  with  the  job  satisfaction  categories  being  1  =  not 
satisfied,  2  =  fairly  satisfied,  3  =  very  or  completely  satisfied. 

When  X  and  Y  are  ordinal,  a  monotone  trend  association  is  common.  For  instance, 
perhaps  job  satisfaction  tends  to  increase  as  age  does.  Measures  that  describe  the  degree 
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Table  2.8  Cross-Classification  of  Job  Satisfaction 
by  Age  of  Respondent 


Age 

Job  Satisfaction 

1 

2 

3 

<30 

34 

53 

88 

30-50 

80 

174 

304 

>50 

29 

75 

172 

Source:  2006  General  Social  Survey,  National  Opinion  Re¬ 
search  Center. 


to  which  a  relationship  is  monotone  can  be  based  on  classifying  each  pair  of  subjects  as 
concordant  or  discordant.  A  pair  is  concordant  if  the  subject  ranked  higher  on  X  also  ranks 
higher  on  Y.  The  pair  is  discordant  if  the  subject  ranking  higher  on  X  ranks  lower  on  Y . 

For  Table  2.8,  consider  a  pair  of  subjects,  one  in  the  cell  ( <30,  1)  and  the  other  in  the 
cell  (30-50,  2).  This  pair  is  concordant,  since  the  second  subject  ranks  higher  than  the 
first  both  on  age  and  on  job  satisfaction.  All  34  subjects  in  cell  (<30,  1)  form  concordant 
pairs  when  matched  with  each  of  the  174  subjects  classified  (30-50,  2),  so  these  two  cells 
provide  34  x  1 74  =  59 1 6  concordant  pairs.  Each  subject  in  the  cell  (<30,  1 )  is  also  part  of  a 
concordant  pair  when  matched  with  each  of  the  other(304 +75  +  172)  subjects  ranked  higher 
on  both  variables.  Similarly,  the  53  subjects  in  the  (<  30,  2)  cell  are  part  of  concordant 
pairs  when  matched  with  the(304  +  i72)subjects  ranked  higher  on  both  variables.  The  total 
number  of  concordant  pairs,  denoted  by  C,  equals 

C  =  34(174  +  304  +  75  +  172) 

+  53(304  +  172)  +  80(75  +  172)  +  174(172)  =  99,566. 

The  total  number  of  discordant  pairs  of  observations  is 

D  =  88(80  +  174  +  29  +  75)  +  53(80  +  29)  +  304(29  +  75)  +  174(29)  =  73,943. 


In  this  example,  C  >  D,  suggesting  a  tendency  for  higher  age  to  be  associated  with  higher 
job  satisfaction. 

Consider  two  independent  observations  from  a  joint  probability  distribution  { tt,;  } .  For 
that  pair,  the  probabilities  of  concordance  and  discordance  are 


n  =  2EE4EE*4 

i  j  '  h>i  k> j  / 


n,r2EEjr'/(EEjr«)' 

V  h>i  k<j  ' 


Here  i  and  j  are  fixed  in  the  inner  summations,  and  the  factor  of  2  occurs  because  the  first 
observation  could  be  in  cell  (/,  /)  and  the  second  in  cell  ( h ,  k ),  or  vice  versa. 


2.4.5  Ordinal  Measure  of  Association:  Gamma 

Given  that  a  pair  is  untied  on  both  variables,  rX/OX  +  rX)  is  the  probability  of  concor¬ 
dance  and  n„  / (IX  +  IX)  is  the  probability  of  discordance.  The  difference  between  these 
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probabilities. 


_  FF  FL 
Fir  +  FL 


(2.14) 


is  called  gamma  (Goodman  and  Kruskal  1954).  The  sample  version  is  y  =  (C  —  D)/ 
(C  +  D). 

For  Table  2.8,  C  =  99,566  and  D  =  73,943.  Hence, 


y  =  (99,566  -  73,943)/(99,566  +  73,943)  =  0.148. 

Only  a  weak  tendency  exists  for  job  satisfaction  to  increase  as  age  increases.  Of  the  untied 
pairs,  the  proportion  of  concordant  pairs  is  0. 1 48  higher  than  the  proportion  of  discordant 
pairs. 

Like  the  correlation,  gamma  treats  the  variables  symmetrically  and  it  has  range  —  1  < 
y  <  1 .  A  reversal  in  the  category  orderings  of  one  variable  causes  a  change  in  the  sign 
of  y .  Whereas  the  absolute  value  of  the  correlation  is  1  when  the  relationship  between  X 
and  Y  is  perfectly  linear,  only  monotonicity  is  required  for  \y\  —  1,  with  y  =  1  if  ]~[d  =  0 
and  y  =  —  1  if  f"[c. .  =  0.  Independence  implies  that  y  =  0,  but  the  converse  is  not  true.  For 
instance,  a  U-shaped  joint  distribution  can  have  ]~[c  =  FI,/  and  hence  y  =  0. 

For  continuous  variables,  samples  can  be  fully  ranked;  that  is,  no  ties  occur.  Then, 
C  +  D  =  (j)  and  y  =  (C  —  D)/  (").  This  is  Kendall's  tau. 


2.4.6  Probabilistic  Comparisons  of  Two  Ordinal  Distributions 

Now  consider  the  special  case  of  a  2  x  /  table,  for  comparing  two  groups  on  an  ordinal 
response  variable  Y.  Let  Y 1  and  Y2  denote  the  column  numbers  of  the  response  variable  for 
subjects  selected  at  random  from  rows  1  and  2,  independently  of  each  other.  A  measure 
that  summarizes  their  relative  size  is 

A  =  P(Y\  >  Y2)  -  P(Y2  >  Ti).  (2.15) 

Related  useful  measures  are  P(Y\  >  Y2)  +  (\)P(Y\  =  Y2)  (Exercise  2.41 )  and  P(Y\  >  Y2)/ 
P(Y2  >  Ti)  (Agresti  2010,  Sec.  2.1.4). 

If  K 1  and  Y2  are  identically  distributed,  then  A  =  0.  When  A  >  0  (<  0),  then  outcomes 
of  Y\  tend  to  be  larger  (smaller)  than  outcomes  of  Y2.  Let  Fj\i  =  7T||,  +  •  •  •  +  7Tj\j.  When 
Fm  <  Fj\ 2  for  j  —  I the  conditional  distribution  in  row  1  is  stochastically  higher 
than  the  one  in  row  2.  This  condition  implies  that  A  >  0.0. 

With  sample  data  in  the  form  of  two  independent  multinomials,  we  can  estimate  A  by 

*  =  ££  Pj\\Pk\2  ~  ££  Pj\\Pk\2- 

j>k  j<k 

If  we  artificially  identify  row  1  as  the  higher  level  of  the  group  variable,  then  this  relates  to 
the  numbers  of  concordant  and  discordant  pairs  by 

A  =  (C  -  £>)/(« i«2). 

With  J  =  2  the  measure  simplifies  to  the  difference  of  proportions. 
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Table  2.9  Shoulder  Tip  Pain  Scores  After 
Laparoscopic  Surgery 


Pain  Scores 

Treatments 

1 

2 

3  4 

5 

Total 

Active 

19 

2 

1  0 

0 

22 

Control 

7 

3 

4  3 

2 

19 

Source:  T.  Lumley,  Biometrics  52:  354-361,  1996. 


2.4.7  Example:  Comparing  Pain  Ratings  After  Surgery 

Table  2.9  is  from  a  study  to  compare  an  active  treatment  with  a  control  treatment  for 
patients  having  shoulder  tip  pain  after  laparoscopic  surgery.  The  two  treatments  were 
randomly  assigned  to  41  patients.  The  patients  rated  their  pain  level  on  a  scale  from  1  (low) 
to  5  (high)  on  the  fifth  day  after  the  surgery. 

The  sample  conditional  distributions  on  shoulder  tip  pain  are: 

Active:  (0.86,  0.09,  0.05,  0.00,  0.00) 

Control:  (0.37,  0.16,  0.21,  0.16,  0.11). 

The  groups  are  stochastically  ordered,  with  active  treatment  patients  tending  to  be  lower  in 
their  pain  rating.  For  these  data, 

,  [1(7 +  3) +  2(7)] -[19(3 +4 +  3 +  2) +  2(4  + 3 +  2) +1(3 +  2)] 

A  = - =  -0.543 

22  x  19 

estimates  the  difference  between  the  probability  that  the  pain  rating  is  higher  for  active  than 
control  treatments  and  the  probability  that  the  pain  rating  is  higher  for  control  than  active 
treatments. 


2.4.8  Correlation  for  Underlying  Normality 

For  ordinal  variables,  another  approach  to  measuring  association  uses  the  correlation.  In 
simplest  form,  you  merely  assign  fixed  scores  or  midrank  scores  to  the  rows  and  to  the 
columns  and  use  the  ordinary  Pearson  correlation  formula. 

An  alternative  approach,  advocated  by  Karl  Pearson,  estimates  the  correlation  for  a 
bivariate  normal  distribution  assumed  to  underlie  the  contingency  table.  Pearson  (1904) 
applied  this  approach  for  2  x  2  tables,  where  his  tetrachoric  correlation  is  the  ML  estimate 
of  the  correlation  for  the  bivariate  normal.  This  is  the  correlation  value  in  the  bivariate 
normal  density  that  produces  cell  probabilities  equal  to  the  sample  cell  proportions  when 
that  density  is  collapsed  to  a  2  x  2  table  having  the  same  marginal  proportions  as  the 
observed  table.  This  approach  was  later  generalized  to  a  polychoric  correlation  for  l  x  J 
tables  (Tallis  1962). 

As  Section  17.1  discusses,  a  strong  disagreement  arose  between  Pearson  and  others  about 
when  it  was  sensible  to  assume  underlying  normality  for  inherently  categorical  variables. 
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Pearson  considered  approximating  underlying  normal  correlations  in  various  ways.  For 
example,  his  contingency  coefficient  (Pearson  1904,  Exercise  3.32)  is  a  function  of  a 
chi-squared  statistic  for  /  xf  tables,  and  his  biserial  correlation  (Pearson  1909)  applies 
to  2  x  c  tables  with  ordered  columns. 


NOTES 

Section  2.2:  Comparing  Two  Proportions 

2.1  Odds  ratio  invariance:  Breslow  (1996)  reviewed  the  development  of  methods  for  case-control 
studies.  For  2x2  tables,  Edwards  ( 1963)  showed  that  functions  of  the  odds  ratio  are  the  only 
statistics  that  are  invariant  both  to  row-column  interchange  and  to  multiplication  within  rows 
or  within  columns  by  a  constant.  For  I  x  J  tables,  Altham  (1970)  gave  related  results.  Yule 
(1912,  p.  587)  had  argued  that  multiplicative  invariance  is  a  desirable  property  for  measures  of 
association,  especially  when  proportions  sampled  in  various  marginal  categories  are  arbitrary. 
Goodman  (2000)  showed  five  ways  of  viewing  association  in  a  2  x  2  table  and  proposed  a 
general  measure  that  includes  all  five. 


Section  2.3:  Conditional  Association  in  Stratified  2x2  Tables 

2.2  Simpson’s  paradox:  Paik  (1985)  proposed  circle  diagrams  of  type  Figure  2.2  to  summarize 
three-way  tables.  For  more  on  Simpson’s  paradox  and  when  it  can  happen,  see  Blyth  (1972), 
Davis  (1989),  Dong  (2005),  Greenland  et  al.  (1999),  Pavlides  and  Perlman  (2009),  Samuels 
(1993),  and  Simpson(  1951 ).  Good  and  Mittal  (1987)  extended  it  to  an  amalgamation  paradox, 
whereby  a  marginal  measure  is  greater  than  the  maximum  or  less  than  the  minimum  of  the 
partial  table  measures. 

2.3  Collapsibility:  For  /  x  J  x  2  tables,  the  odds  ratio  collapsibility  conditions  in  Section  2.3.6 
are  necessary  as  well  as  sufficient  (Simpson  1951,  Whittemore  1978).  For  /  x  J  x  K  tables, 
Ducharme  and  Lepage  (1986)  showed  the  conditions  are  necessary  and  sufficient  for  the  odds 
ratios  to  remain  the  same  no  matter  how  the  levels  of  Z  are  pooled  (i.e..  no  matter  how  Z  is 
partially  collapsed).  For  collapsibility  for  the  difference  of  proportions  and  relative  risk,  see 
Geng  (1992),  Shapiro  ( 1982),  and  Wermuth  (1987). 


Section  2.4:  Measuring  Association  in  /  x  J  Tables 

2.4  Surveys:  Goodman  and  Kruskal  ( 1 954,  1 959)  surveyed  the  historical  development  of  measures 
of  association  and  introduced  new  measures.  Agresti  (2010,  Chaps.  2  and  7)  and  Kruskal  (1958) 
surveyed  ordinal  measures  of  association. 


EXERCISES 

Applications 

2.1  According  to  the  FBI  website  (www.  fbi  .gov),  in  2008,  of  female  murder  victims, 
1710  were  slain  by  males  and  200  by  females,  whereas  of  male  murder  victims,  4351 
were  slain  by  males  and  455  by  females.  Let  Y  denote  sex  of  victim  andX  denote 
sex  of  offender.  Report  the  sample  (a)  joint  distribution  of  X  and  Y,  (b)  conditional 
distribution  of  Y  given  X,  and  (c)  conditional  distribution  of  X  given  Y . 
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2.2  According  to  the  FBI  website,  of  all  blacks  slain  in  2008,  92%  were  slain  by  blacks, 
and  of  all  whites  slain  in  2005,  85%  were  slain  by  whites.  Let  Y  denote  race  of 
victim  and  X  denote  race  of  offender. 

a.  Which  conditional  distribution  do  these  statistics  refer  to,  Y  given  X,  or  X  given 

y? 

b.  Given  that  a  murderer  was  white,  what  additional  information  would  you  need 
to  estimate  the  probability  that  the  victim  was  white?  [Hint:  How  could  you  use 
Bayes’  theorem?] 

c.  Consider  the  previous  exercise.  Which  association  is  stronger — between  sex  of 
victim  and  sex  of  offender,  or  between  race  of  victim  and  race  of  offender?  Justify 
your  answer. 

2.3  An  article  in  The  New  York  Times  (Feb.  17,  1999)  about  the  PSA  blood  test  for 
detecting  prostate  cancer  stated:  “The  test  fails  to  detect  prostate  cancer  in  1  in  4 
men  who  have  the  disease  (false-negative  results),  and  as  many  as  two-thirds  of  the 
men  tested  receive  false-positive  results.”  Let  C(C )  denote  the  event  of  having  (not 
having)  prostate  cancer,  and  let  +(— )  denote  a  positive  (negative)  test  result.  Which 
is  true:  P(—  |C)  =  \  or  P(C|-)  =  |?  P(C|+)  =  |  or  P(+|C)  =  |?  Determine  the 
sensitivity  and  specificity. 

2.4  Table  2.10  shows  fatality  results  for  drivers  and  passengers  in  auto  accidents  in 
Florida  in  2008,  according  to  whether  the  person  was  wearing  a  seat  belt. 

a.  Estimate  the  probability  of  fatality,  conditional  on  seat-belt  use  in  category  (i)  no 
and  (ii)  yes. 

b.  Estimate  the  probability  of  wearing  a  seat  belt,  conditional  on  the  injury  being  (i) 
fatal  and  (ii)  nonfatal. 

c.  For  the  most  natural  choice  of  response  variable,  find  and  interpret  the  difference 
of  proportions,  relative  risk,  and  odds  ratio.  Why  are  the  relative  risk  and  odds 
ratio  approximately  equal? 


Table  2.10  Data  for  Exercise  2.4  on  Auto  Accidents 


Injury 

Seat-Belt  Use 

Fatal 

Nonfatal 

No 

1085 

55,623 

Yes 

703 

441,239 

Source:  Florida  Department  of  Highway  Safety  and  Motor  Vehicles, 
www . f lhsmv . gov/hsmvdocs/CS2008 . pdf. 


2.5  Consider  the  following  two  studies  reported  in  The  New  York  Times. 

a.  A  British  study  reported  (Dec.  3,  1998)  that  of  smokers  who  get  lung  cancer, 
“women  were  1.7  times  more  vulnerable  than  men  to  get  small-cell  lung  cancer.” 
Is  1 .7  the  odds  ratio  or  the  relative  risk? 
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b.  A  National  Cancer  Institute  study  about  tamoxifen  and  breast  cancer  reported 
(Apr.  7,  1998)  that  the  women  taking  the  drug  were  45%  less  likely  to  experience 
invasive  breast  cancer  than  were  women  taking  placebo.  Find  the  relative  risk 
for  (i)  those  taking  the  drug  compared  with  those  taking  placebo,  and  (ii)  those 
taking  placebo  compared  with  those  taking  the  drug. 

2.6  According  to  a  report  by  the  United  Nations  Office  on  Drugs  and  Crime,  the  number 
of  homicides  involving  firearms  per  million  people  is  about  62.4  in  the  United  States, 
6.0  in  Canada,  5.6  in  Australia,  and  1.3  in  the  UK.  Use  the  relative  risk  to  compare 
the  United  States  with  the  other  countries.  For  such  data,  explain  why  the  relative 
risk  is  more  informative  than  the  difference  of  proportions. 

2.7  An  article  in  The  Economist  (July  3,  2010)  stated  that  the  number  of  people  in  prison 
is  154  per  100,000  in  England  and  Wales,  96  per  100,000  in  France,  87  per  100,000 
in  Germany,  and  753  per  100,000  in  the  United  States  Explain  how  to  use  the  relative 
risk  to  compare  the  U.S.  rate  to  the  others. 

2.8  At  the  start  of  the  2010  World  Cup,  the  betting  exchange  Betfair  stated  that  the  odds 
against  being  the  winning  team  were  9/2  for  Spain,  1 1/2  for  Brazil,  6/1  for  England, 
and  90/1  for  the  United  States.  Find  the  corresponding  prior  probabilities  of  winning 
for  these  four  teams. 

2.9  In  a  recent  survey  of  people  aged  50-71  in  the  United  States  summarized  by  N. 
Freedman  et  al.  ( Lancet  Oncol.  9:  649-656,  2008),  during  a  follow-up  period  the 
annual  probability  of  lung  cancer  occurrence  was  about  0.00023  for  people  who  had 
never  smoked  and  about  0.01284  for  current  smokers  who  smoked  more  than  two 
packs  per  day.  Find  and  interpret  the  difference  of  proportions  and  the  relative  risk. 
Which  measure  is  more  informative  for  these  data?  Why? 

2.10  For  adults  who  sailed  on  the  Titanic  on  its  fateful  voyage,  the  odds  ratio  between 
gender  (female,  male)  and  survival  (yes,  no)  was  11.4.  (For  data,  see  R.  J.  M. 
Dawson,  J.  Statist.  Ed.  3,  1995.) 

a.  What  is  wrong  with  the  interpretation,  “The  probability  of  survival  for  females 
was  1 1 .4  times  that  for  males”?  Give  the  correct  interpretation.  When  would  the 
quoted  interpretation  be  approximately  correct? 

b.  The  odds  of  survival  for  females  equaled  2.9.  For  each  gender,  find  the  proportion 
who  survived. 

2.11  A  research  study  estimated  that  under  a  certain  condition,  the  probability  that  a 
subject  would  be  referred  for  heart  catheterization  was  0.906  for  whites  and  0.847 
for  blacks. 

a.  A  press  release  about  the  study  stated  that  the  odds  of  referral  for  cardiac  catheter¬ 
ization  for  blacks  are  60%  of  the  odds  for  whites.  Explain  how  they  obtained  60% 
(more  accurately,  57%). 

b.  An  Associated  Press  story  later  described  the  study  and  said  “Doctors  were 
only  60%  as  likely  to  order  cardiac  catheterization  for  blacks  as  for  whites.” 
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Explain  what  is  wrong  with  this  interpretation.  Give  the  correct  percentage  for 
this  interpretation. 

2.12  A  20-year  cohort  study  of  British  male  physicians  (R.  Doll  and  R.  Peto,  Br.  Med.  J. 
2:  1525-1536,  1976)  noted  that  the  proportion  per  year  who  died  from  lung  cancer 
was  0.00140  for  cigarette  smokers  and  0.00010  for  nonsmokers.  The  proportion 
who  died  from  coronary  heart  disease  was  0.00669  for  smokers  and  0.00413  for 
nonsmokers. 

a.  Describe  the  association  of  smoking  with  each  of  lung  cancer  and  heart  disease, 
using  the  difference  of  proportions,  relative  risk,  and  odds  ratio.  Interpret. 

b.  Which  response  is  more  strongly  related  to  cigarette  smoking,  in  terms  of  the 
reduction  in  number  of  deaths  that  would  occur  with  elimination  of  cigarettes? 
Explain. 

2.13  For  the  Women’s  Health  Study,  heart  attacks  were  reported  for  1 98  of  1 9,934  taking 
aspirin  and  for  193  of  19,942  taking  placebo  ( J .  Am.  Med.  Assoc.  295:  306-313, 
2006).  Construct  the  2  x  2  table  that  cross-classifies  the  treatment  with  whether  a 
heart  attack  was  reported.  Estimate  the  odds  ratio.  Interpret.  (As  of  2006,  results 
suggested  that,  for  women,  aspirin  was  helpful  for  reducing  risk  of  stroke  but  not 
necessarily  risk  of  heart  attack.) 

2.14  According  to  poll  results  released  by  the  Pew  Research  Center  (www. people- 
press,  org)  in  2010,  when  adults  in  the  United  States  were  asked  whether  there 
is  solid  evidence  that  the  average  temperature  on  earth  has  been  getting  warmer 
over  the  past  few  decades,  the  estimated  odds  of  a  yes  response  for  a  Democrat 
was  2.96  times  higher  than  for  an  Independent,  and  it  was  2.08  times  higher  for  an 
Independent  than  for  a  Republican.  Find  the  estimated  odds  ratio  between  opinion 
on  global  warming  and  whether  one  is  a  Democrat  or  a  Republican.  Interpret. 

2.15  Table  2.1 1  refers  to  applicants  to  graduate  school  at  the  University  of  California  at 
Berkeley,  for  fall  1973.  It  presents  admissions  decisions  by  gender  of  applicant  for 


Table  2.11  Data  for  Exercise  2.15  on  Graduate  Admissions 


Department 

Whether  Admitted 

Male 

Female 

Yes 

No 

Yes 

No 

A 

512 

313 

89 

19 

B 

353 

207 

17 

8 

C 

120 

205 

202 

391 

D 

138 

279 

131 

244 

E 

53 

138 

94 

299 

F 

22 

351 

24 

317 

Total 

1198 

1493 

557 

1278 

Source:  Data  from  P.  Bickel  et  al.,  Science  187:  398 — 403 ,  1975. 
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the  six  largest  graduate  departments.  Denote  the  three  variables  by  A  =  whether 
admitted,  G  =  gender,  and  D  =  department.  Find  the  sample  AG  conditional  odds 
ratios  and  the  marginal  odds  ratio.  Interpret,  and  explain  why  they  give  such  different 
indications  of  the  AG  association. 

2.16  State  three  “real-world”  variables  X,  Y,  and  Z  for  which  you  expect  a  marginal 
association  between  X  and  Y  but  conditional  independence  controlling  for  Z. 

2.17  Based  on  murder  rates  in  the  United  States,  an  Associated  Press  story  reported  that 
the  probability  that  a  newborn  child  has  of  eventually  being  a  murder  victim  is 
0.0263  for  nonwhite  males,  0.0049  for  white  males,  0.0072  for  nonwhite  females, 
and  0.0023  for  white  females. 

a.  Find  the  conditional  odds  ratios  between  race  and  whether  a  murder  victim,  given 
gender.  Interpret.  Do  these  variables  exhibit  homogeneous  association? 

b.  Half  the  newborns  are  of  each  gender,  for  each  race.  Find  the  marginal  odds  ratio 
between  race  and  whether  a  murder  victim. 

2.18  At  each  age  level,  the  death  rate  is  higher  in  South  Carolina  than  in  Maine,  but 
overall,  the  death  rate  is  higher  in  Maine.  Explain  how  this  could  be  possible.  [For 
data,  see  H.  Wainer,  Chance  12(2):  44,  1999.] 

2.19  A  study  of  the  death  penalty  for  cases  in  Kentucky  between  1976  and  1991  (T.  Keil 
and  G.  Vito,  Am.  J.  Criminal  Justice  20:  17-36,  1995)  indicated  that  the  defendant 
received  the  death  penalty  in  8%  of  the  391  cases  in  which  a  white  killed  a  white,  in 
2%  of  the  108  cases  in  which  a  black  killed  a  black,  in  12%  of  the  57  cases  in  which 
a  black  killed  a  white,  and  in  0%  of  the  18  cases  in  which  a  white  killed  a  black. 
Form  the  three-way  contingency  table,  obtain  the  conditional  odds  ratios  between 
the  defendant’s  race  and  the  death  penalty  verdict,  interpret  those  associations,  study 
whether  Simpson’s  paradox  occurs,  and  explain  why  the  marginal  association  is  so 
different  from  the  conditional  associations. 

2.20  Table  2. 1 2  is  from  an  early  study  on  the  death  penalty  in  Florida.  Analyze  these  data 
and  show  that  Simpson’s  paradox  occurs. 


Table  2.12  Data  for  Exercise  2.20  on  the  Death  Penalty 


Victim’s  Race 

Defendant’s  Race 

Death  Penalty 

Yes  No 

White 

White 

19 

132 

Black 

11 

52 

Black 

White 

0 

9 

Black 

6 

97 

Source:  Reprinted  with  permission  from  M.  L.  Radelet,  Am.  Sociol.  Rev.  46: 
918-927, 1981. 
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Table  2.13  Data  for  Exercise  2.22  on  Sexual  Attitudes 


Premarital 

Sex 

Homosexual  Sex 

Always 

Wrong 

Almost  Always 
Wrong 

Wrong  Only 
Sometimes 

Not  Wrong 
At  All 

Always  Wrong 

300 

4 

4 

17 

Almost  Always  Wrong 

78 

15 

3 

14 

Wrong  Only  Sometimes 

107 

16 

46 

54 

Not  Wrong  At  All 

234 

32 

35 

336 

Source :  General  Social  Survey,  2008. 


2.21  Smith  and  Jones  are  baseball  players.  Smith  has  a  higher  batting  average  than  Jones 
in  each  of  K  years.  Is  is  possible  that  for  the  combined  data  from  the  K  years,  Jones 
has  the  higher  batting  average?  Explain,  creating  some  data  with  K  =  2  to  illustrate. 

2.22  Table  2. 13  summarizes  responses  from  a  General  Social  Survey  about  homosexual 
sex  and  premarital  sex.  Find  and  interpret  a  measure  of  association. 

2.23  For  the  data  in  Table  2. 1 3,  the  two  marginal  distributions  are  dependent  rather  than 
independent  samples,  but  the  measure  A  can  still  compare  those  distributions.  Find 
it,  and  interpret. 

2.24  Table  2. 14  cross-classifies  job  satisfaction  by  race.  Determine  whether  the  groups  are 
stochastically  ordered,  and  estimate  the  difference  between  the  probability  that  job 
satisfaction  is  higher  for  blacks  than  whites  and  the  probability  that  job  satisfaction 
is  higher  for  whites  than  blacks. 


Table  2.14  Cross-Classification  of  Job  Satisfaction  by  Race  of 
Respondent 


Job  Satisfaction 

Fairly 

Very  or  Completely 

Race 

Dissatisfied 

Neutral 

Satisfied 

Satisfied 

Black 

19 

13 

42 

59 

White 

47 

40 

215 

430 

Source:  2006  General  Social  Survey,  National  Opinion  Research  Center. 


Theory  and  Methods 

2.25  For  a  diagnostic  test  of  a  certain  disease,  let  n\  denote  the  probability  that  the 
diagnosis  is  positive  given  that  a  subject  has  the  disease,  and  let  tti  denote  the 
probability  that  the  diagnosis  is  positive  given  that  a  subject  does  not  have  it.  Let  p 
denote  the  probability  that  a  subject  has  the  disease. 

a.  More  relevant  to  a  patient  who  has  received  a  positive  diagnosis  is  the  probability 
that  he  or  she  truly  has  the  disease.  Given  that  a  diagnosis  is  positive,  show  that 
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the  probability  that  a  subject  has  the  disease  (called  the  positive  predictive  value ) 
is 


tt\  p/[tt\  P+n 2(1  -  P)\- 

b.  Suppose  that  a  diagnostic  test  for  HIV+  status  has  both  sensitivity  and  specificity 
equal  to  0.95,  and  p  =  0.005.  Find  the  probability  that  a  subject  is  truly  HIV+, 
given  that  the  diagnostic  test  is  positive. 

c.  To  better  understand  the  answer  in  (b),  using  the  probabilities  given  there  either  (i) 
find  the  joint  probabilities  relating  diagnosis  to  actual  disease  status  and  discuss 
their  relative  sizes,  or  (ii)  construct  a  tree  diagram  showing  what  you  would  expect 
to  happen  for  a  typical  sample  of  1000  subjects  (first  branching  from  the  root 
according  to  whether  a  subject  is  truly  HIV+  and  then  branching  according  to  the 
test  result),  showing  that  of  the  subjects  with  a  positive  diagnosis,  the  proportion 
actually  HIV+  agrees  with  the  result  in  (b). 

d.  Discuss  how  the  answer  in  (b)  depends  on  the  prevalence  p.  Illustrate  by  finding 
the  answer  when  p  =0.10  instead  of  0.005. 

2.26  Show  that  the  odds  ratio  and  relative  risk  need  not  be  similar  when  n,  is  close  to  1 .0 
for  both  groups. 

2.27  Let  D  denote  having  a  certain  disease  and  E  denote  having  exposure  to  a  certain  risk 
factor.  The  attributable  risk  (AR)  is  the  proportion  of  disease  cases  attributable  to 
that  exposure  (see  Benichou  2005). 

a.  Let  P(E)  =  1  —  P(E).  Explain  why 

AR  =  [P(D)  -  P(D\E)]/P(D). 

b.  Show  that  AR  relates  to  the  relative  risk  RR  by 

AR  =  [F(£)(RR  -  1)]/[1  +  P(E)( RR  -  1)]. 

2.28  In  comparing  new  and  standard  treatments  with  success  probabilities  tt \  and  712,  the 
number  needed  to  treat  ( NNT )  is  the  number  of  patients  that  would  need  to  be  treated 
with  the  new  treatment  instead  of  the  standard  in  order  for  one  patient  to  benefit. 
Explain  why  a  natural  estimate  of  this  is  1  /{A  1  —  A 2 ). 

2.29  For  a  2  x  2  table  of  counts  {«,,},  show  that  the  odds  ratio  is  invariant  to  (a)  inter¬ 
changing  rows  with  columns,  and  (b)  multiplication  of  cell  counts  within  rows  or 
within  columns  by  c  ^  0.  Show  that  the  difference  of  proportions  and  the  relative 
risk  do  not  have  these  properties. 

2.30  For  given  Tt\  and  712,  show  that  the  relative  risk  cannot  be  farther  than  the  odds  ratio 
from  their  independence  value  of  1.0. 
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2.31  Let  7r,j|*  =  P(X  =  i,Y  =  j\Z  =  k).  Explain  why  XY  conditional  independence  is 

Tiij\k  =  TTi+\k  it+j\t  for  all  i  and  j  and  k. 

2.32  For  a  2  x  2  x  2  table,  show  that  homogeneous  association  is  a  symmetric  property, 
by  showing  that  equal  XY  conditional  odds  ratios  is  equivalent  to  equal  YZ  conditional 
odds  ratios. 

2.33  Fora2  x  2  x  2  table,  suppose  Oxyo)  —  #xr(  2)  =  0.Forapossibly  confounding  vari¬ 
able  Z,  let  0C.  denote  the  common  value  of  0(,)kz- Let  =  P(Z  =  1|Y  =  l,y  —2) 
and  n2  =  P(Z  =  1  \X  =  2,  Y  =  2). 

a.  Show  (Breslow  and  Day  1980,  p.  96)  that 

a  a0cn,+(l-nl) 

UYV  =  U - . 

ecTT2  +  (\  —  It  2) 

b.  Verify  that  either  odds  ratio  collapsibility  condition  in  Section  2.3.6  implies  that 
the  confounding  risk  ratio  6xy/8  equals  1 .0. 

c.  Describe  what  needs  to  happen  for  8xy  /8  to  be  far  from  1.0.  Illustrate  with 
particular  values  of  0C  >  1  and  it\  >  n2.  Describe  a  study  in  which  such  values 
would  be  plausible. 

2.34  When  X  and  Y  are  conditionally  dependent  at  each  level  of  Z  yet  marginally  inde¬ 
pendent,  Z  is  called  a  suppressor  variable.  Specify  joint  probabilities  for  a  2  x  2  x  2 
table  to  show  that  this  can  happen  (a)  when  there  is  homogeneous  association,  and 
(b)  when  the  association  has  opposite  direction  in  the  partial  tables. 

2.35  Show  that  the  {a,;}  in  (2.1 1)  determine  all  odds  ratios  formed  from  pairs  of  rows 
and  pairs  of  columns. 

2.36  For  /  x  J  contingency  tables,  explain  why  the  variables  are  independent 
when  the  (/  —  1)(/  —  1)  differences  it  j\j  —  Ttjy  =  0,  i  =  1, . . . ,  /  —  1,  j  =  1,  . . . , 
J-l. 

2.37  Suppose  that  {Yjj}  are  independent  Poisson  variates  with  means  {/r(/}.  Show  that 
P(Y,j  =  njj)  for  all  i,j,  conditional  on  {F,+  =  «,},  satisfy  independent  multinomial 
sampling  [i.e.,  the  product  of  (2.2)  for  all  /]  within  the  rows. 

2.38  For  2x2  tables.  Yule  (1900,  1912)  introduced 

q  _  n n  7122  ~  77,2  7121 

It  1 1  It 22  +  7T 1 2  7^21 

which  he  labeled  Q  in  honor  of  the  Belgian  statistician  Quetelet.  It  is  now  called 
Yule’ s  Q. 

a.  Show  that  for  2  x  2  tables,  Goodman  and  Kruskal’s  y  =  Q. 

b.  Show  that  Q  relates  to  the  odds  ratio  by  Q  =  (9  —  1  )/{6  +  1),  a  monotone  trans¬ 
formation  of  6  from  the  [0,  oo]  scale  onto  the  [—1,  +1]  scale. 
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2.39  Goodman  and  Kruskal  (1954)  proposed  an  association  measure  (tau)  for  nominal 
variables  based  on  variation  measure 

V(Y)  -  ^Tt+jO  -  7i+ j )  =  1  ~J2n+j ' 

a.  Show  that  V(Y)  is  the  probability  that  two  independent  observations  on  Y  fall 
in  different  categories.  Show  that  V (Y )  =  0  when  tt+  ;  =  1  for  some  j  and  V(Y) 
takes  maximum  value  of  (J  —  1  )/J  when  jt+j  =  1  /J  for  all  j.  This  index  relates 
to  measures  of  concentration  and  diversity  proposed  for  various  applications, 
such  as  by  Corrado  Gini  (1914a),  who  was  highly  influential  in  the  twentieth 
century  in  the  development  of  descriptive  statistics  in  Italy,  and  by  E.  H.  Simpson 
(1949)  who  described  species  diversity  (see  Exercise  16.13). 

b.  For  the  proportional  reduction  in  variation,  show  that  £[V/(T|30]  =  1  - 
Hi  Hj  nfj/ni+-  [The  resulting  measure  (2.12)  is  called  the  concentration  co¬ 
efficient.  Like  the  uncertainty  coefficient  U,  r  =  0  is  equivalent  to  indepen¬ 
dence.  Haberman  (1982)  presented  generalized  concentration  and  uncertainty 
coefficients.] 

2.40  The  measure  of  association  lambda  for  nominal  variables  (Goodman  and  Kruskal 
1954)  has  T(T)  =  1  —  max{^r+;}  and  V(Y\i)  =  1  —  max/ftr/i,-}.  Interpret  lambda 
as  a  proportional  reduction  in  error  for  predictions  which  select  the  response  category 
that  is  most  likely.  Show  that  independence  implies  A.  =  0  but  that  the  converse  is 
not  true. 

2.41  Show  that  A  in  (2.15)  relates  to  a  =  P(Y]  >  Y1)  +  {{)P{Y[  =  Y2)  by 

a  =  (A  +  1  )/2,  A  =  2a  —  1 , 
with  a  having  range  [0,  1]  and  null  value 


CHAPTER  3 


Inference  for  Two-Way 
Contingency  Tables 


In  this  chapter  we  introduce  inferential  methods  for  contingency  tables.  Many  of  these  meth¬ 
ods  also  play  a  vital  role  in  analyses,  presented  in  later  chapters,  for  which  categorical  data 
need  not  have  contingency  table  form — such  as  when  some  explanatory  variables  are  con¬ 
tinuous.  The  methods  assume  a  standard  sampling  scheme  for  categorical  data — Poisson, 
multinomial,  or  independent  multinomial  (or  binomial)  sampling. 

In  Section  3.1  we  present  confidence  intervals  for  measures  of  association,  such  as  the 
odds  ratio  and  the  difference  and  ratio  of  proportions.  Section  3.2  introduces  chi-squared 
tests  of  the  hypothesis  of  independence  between  two  categorical  variables  and  confidence 
intervals  obtained  by  inverting  more  general  chi-squared  tests.  In  Section  3.3  we  show 
how  to  follow-up  chi-squared  tests  using  residuals  and  the  partitioning  property  of  chi- 
squared  to  extract  components  that  describe  the  evidence  about  the  association.  For  ordinal 
variables,  in  Section  3.4  we  present  more  powerful  inference  that  utilizes  the  category 
orderings.  The  methods  of  Sections  3. 1  through  3.4  assume  large  samples.  In  Section  3.5  we 
introduce  small-sample  methods.  In  Section  3.6  we  present  Bayesian  methods  of  inference 
for  contingency  tables. 


3.1  CONFIDENCE  INTERVALS  FOR  ASSOCIATION  PARAMETERS 

The  precision  of  estimators  of  association  parameters  is  characterized  by  standard  errors  of 
their  sampling  distributions.  In  this  section  we  present  standard  errors  and  simple  confidence 
intervals,  focusing  on  parameters  for  2  x  2  tables.  We'll  present  alternative  intervals,  based 
on  inverting  score  and  likelihood-ratio  tests,  in  Sections  3.2.5  and  3.2.6. 

3.1.1  Interval  Estimation  of  the  Odds  Ratio 

The  sample  odds  ratio  for  a  2  x  2  table  is#  =  («i  i«22)/0?  12^21)-  Fora  multinomial  sample, 
the  estimator  6  has  an  asymptotic  normal  distribution  around  0.  Unless  n  is  very  large, 
however,  its  sampling  distribution  is  highly  skewed.  When  6  =  1,  for  instance,  9  cannot 
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be  much  smaller  than  9  (since  6  >  0),  but  it  could  be  much  larger  with  nonnegligible 
probability.  The  log  transform,  having  an  additive  rather  than  multiplicative  structure, 
converges  more  rapidly  to  normality.  An  estimated  standard  error  for  log  6  is 


d-(log  0)  = 


1 

ft  22 


(3.1) 


We  derive  this  formula  in  Section  3.1.7. 

By  the  large-sample  normality  of  log  9 , 

log  9  ±  za/2d( log  9)  (3.2) 

is  a  Wald  confidence  interval  for  log  9  (Woolf  1955).  Exponentiating  (taking  antilogs  of)  its 
endpoints  provides  a  confidence  interval  for  9.  The  actual  coverage  probability  is  usually  a 
bit  higher  than  the  nominal  level. 

If  an  rijj  =  0,  9  equals  0  or  oo  and  the  Wald  interval  does  not  exist.  Since  such  an 
outcome  has  positive  probability,  the  actual  expected  value  and  variance  of  9  and  log  9  do 
not  exist1.  This  is  not  problematic  for  confidence  intervals  formed  by  inverting  the  score 
test  or  likelihood-ratio  test  for  9.  For  these  intervals,  when  9  —  0,  0  is  the  lower  limit  and 
when  9  =  oo,  oo  is  the  upper  limit.  This  is  sensible  for  a  frequentist  approach.  This  also 
happens  when  we  construct  a  small-sample  confidence  interval  for  the  odds  ratio  to  be 
introduced  in  Section  16.6.4.  Alternatively,  but  somewhat  ad  hoc,  we  can  use  the  Wald 
formula  (3.2)  following  some  adjustment,  such  as  by  replacing  {/?/,}  by  {/t,y  +  0.5)  in  the 
estimator  and  standard  error.  In  terms  of  bias  and  mean  squared  error,  Gart  and  Zweifel 
(1967)  and  Haldane  (1956)  showed  that  such  amended  estimators  perform  well  (see  also 
Exercise  16.8). 


3.1.2  Example:  Seat-Belt  Use  and  Traffic  Deaths 

We  illustrate  inference  for  the  odds  ratio  with  Table  3.1,  which  shows  fatality  results  for 
children  under  age  18  who  were  passengers  in  auto  accidents  in  Florida  in  2008,  according 
to  whether  the  child  was  wearing  a  seat  belt.  The  sample  odds  ratio  9  —  10.83,  and  the 
standard  error  (3.1)  of  log  0  =  2.383  is  d(log  9)  —  0.242.  A  95%  confidence  interval  for 
log  9  in  the  population  this  sample  represents  is  2.383  ±  1.96(0.242),  or  (1.908,  2.857). 


Table  3.1  Injury  Outcome  and  Seat-Belt  Use  for  Child 
Passengers  in  Automobile  Accidents  in  Florida  in  2008 


Injury  Outcome 

Seat-Belt  Use 

Fatal 

Nonfatal 

Total 

No 

54 

10,325 

10,379 

Yes 

25 

51,790 

51,815 

Source:  Florida  Department  of  Highway  Safety  and  Motor  Vehicles, 
www . f lhsmv . gov/hsmvdocs/CS20  0  8 . pdf. 


This  is  also  true  for  ML  estimators  of  model  parameters  presented  in  later  chapters. 
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The  corresponding  interval  for  0  is  [exp(1.908),  exp(2.857)]  or  (6.74,  17.42).  There  is  a 
very  strong  association.  Even  though  the  overall  sample  size  is  extremely  large,  the  estimate 
of  the  true  odds  ratio  is  rather  imprecise  because  of  the  relatively  small  number  of  fatalities 
(Exercise  3.25). 


3.1.3  Interval  Estimation  of  Difference  of  Proportions  and  Relative  Risk 

The  difference  of  proportions  and  the  relative  risk  compare  conditional  distributions  of  a 
response  variable  for  two  groups.  For  these  measures,  we  treat  the  samples  as  independent 
binomials.  For  group  i,  Y,  has  a  binomial  distribution  with  sample  size  n,  and  a  probability 
7i i  of  a  “success”  outcome. 

The  sample  proportion  A,  =  y,  / n,  has  expectation  7T,  and  variance  7T,  ( 1  —  n,  )/ n,.  Since 
7fi  and  JT2  are  independent,  their  difference  has  E(A\  —  712)  =  7i\  —712  and  standard  error 


cr(A\  -  7T2)  = 


ITT  1  (1  ~  7t  |  )  7T2(1  ~  7T2) 


n  2 


The  estimate  a(A\  —  jf2)  replaces  71,  by  if/.  Then 


(3.3) 


(At  -  A2)  ±  za/2a(i t,  -  if2)  (3.4) 

is  a  Wald  confidence  interval  for  7T|  —7 r2.  Like  the  Wald  interval  (1.13)  for  a  single 
proportion,  it  usually  has  true  coverage  probability  less  than  the  nominal  confidence  level, 
especially  when7T|  and7T2  are  near  0  or  1.  Section  3.2.5,  Note  3.1,  and  Exercise  3.27  present 
other  methods. 

The  sample  relative  risk  is  r  =  if|/jf2  =  [(jVi  /« 1  )/(jV2/ «2)]-  Like  the  odds  ratio,  it  con¬ 
verges  to  normality  faster  on  the  log  scale.  An  estimated  standard  error  for  log  r  is 


6” (log  r)  - 


(3.5) 


The  Wald  interval  exponentiates  endpoints  of  log  r  ±  zu/2  rf  (log  r).  It  tends  to  be  somewhat 
conservative. 


3.1.4  Example:  Aspirin  and  Heart  Attacks  Revisited 

We  consider  again  Table  2.1  from  the  Harvard  study  on  aspirin  use  and  heart  attacks. 
The  proportions  having  fatal  heart  attacks  were  18/11,034  =  0.00163  for  those  taking 
placebo  and  5/11,037  =  0.00045  for  those  taking  aspirin.  The  sample  relative  risk  is 
0.00163/0.00045  =  3.60.  The  95%  confidence  interval  for  the  log  relative  risk,  using 
<7 (log/-)  =  0.505,  is  log(3.60)  ±  1.96(0.505).  This  translates  to  (1.34,  9.70)  for  the  relative 
risk.  We  infer  that  the  death  rate  for  those  taking  placebo  was  between  1 .34  and  9.70  times 
that  for  those  taking  aspirin.  Substantial  public  health  benefits  could  result  from  taking 
aspirin,  but  the  estimated  effect  is  imprecise  despite  the  very  large  sample  sizes  because  of 
the  very  low  rate  of  heart  attack  deaths  over  the  study  period,  regardless  of  treatment. 
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The  Wald  95%  confidence  interval  for  n\  —  is  0.0012  ±  1 .96(0.00043)  or 
(0.0003,  0.0020).  The  relative  risk  is  more  useful  than  n \  —  tti  for  these  data,  because 
the  rates  of  heart  attack  death  were  both  very  low  but  with  ratio  quite  far  from  1 .0. 

3.1.5  Deriving  Standard  Errors  with  the  Delta  Method 

A  simple  and  useful  method  exists  of  deriving  standard  errors.  Let  T„  denote  a  statistic 
that  is  asymptotically  normally  distributed  about  a  parameter  8,  the  subscript  n  expressing 
its  dependence  on  sample  size.  Suppose  that  an  estimator  is  a  function  g(T„)  of  T„.  Then, 
under  mild  conditions,  g(Tn)  itself  has  a  large-sample  normal  distribution.  The  standard 
error  depends  on  the  rate  of  change  of  g(t)  at  t  =  0. 

Specifically,  for  large  n,  suppose  that  T„  is  normally  distributed  about  8  with  standard 
error  o/^/n.  That  is,  as  n  — >  oo,  the  cdf  of  y/n(Tn  —  8)  converges  to  the  cdf  of  a  normal 
random  variable  with  mean  0  and  variance  a1.  This  limiting  behavior  is  an  example  of 
convergence  in  distribution,  denoted  by 

MTn  -  0)  4  N( 0,  or2). 

Let  g  be  a  function  that  is  at  least  twice  differentiable  at  8.  From  the  Taylor  series  expansion 
for  g(t)  in  a  neighborhood  off  =8, 

Vn[g(T„)  -  g(6)]  *=  sfn(T„  -  8)g\8) 

for  large  n,  where  g'(8)  —  dg/dt  evaluated  alt  —  8.  Recall  if  a  variate  Y  ~  N(  0,  ct2),  then 
cY  ~  N( 0,  c2o2).  Thus, 


Mg(Tn)  -  g(9)]  4  N{ 0,  [g’(8)]2o2).  (3.6) 

In  other  words,  g{Tn)  is  approximately  normal  around  g(0)  with  variance  [g'(d)]2o2 / n. 
Section  16.1.2  gives  details. 

Figure  3. 1  portrays  this  result.  Locally  around  8,  g(t)  is  approximately  linear,  with  slope 
g\8).  Then  g(Tn)  is  approximately  normal,  since  linear  transformations  of  normal  random 


Figure  3.1  Depiction  of  delta  method. 
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variables  are  themselves  normal.  The  dispersion  of  g(Tn)  values  about  g{9)  is  about  \g'(9)\ 
times  the  dispersion  of  T„  values  about  6.  For  example,  if  the  slope  of  g  at  9  is  then  g 
maps  a  region  of  T„  values  into  a  region  of  g(Tn)  values  only  about  half  as  wide. 

Result  (3.6)  is  called  the  delta  method.  Since  g'{9)  and  a 1  —  o2(9)  usually  depend  on 
the  unknown  parameter  9,  the  asymptotic  variance  is  unknown.  Wald  confidence  intervals 
substitute  T„  for#  and  use  the  result  that  *Jn[g(T„)  —  g{9)]/\g'(Tn)\o{Tn)  is  asymptotically 
standard  normal.  Thus, 


g{Tn)±\.96\g'{T„)\o(Tn)l^i 
is  a  large-sample  Wald  95%  confidence  interval  for  g(9). 


3.1.6  Delta  Method  Applied  to  the  Sample  Logit 

We  illustrate  the  delta  method  for  a  function  of  the  ML  estimator  Tn  =  ft  =  y/n  of  the 
binomial  parameter  tt,  for  y  successes  in  n  trials.  Recall  that  E(ft)  —  tt  and  var( ft )  = 
tt(  1  —  n)/n.  Also,  ft  has  a  large-sample  normal  distribution  by  the  central  limit  theorem. 
So  do  many  functions  of  ft . 

The  log  odds  function  of  ft. 


g(fr)  =  log[jf/(l  -  7T>], 

is  called  the  sample  logit.  Evaluated  at  tt,  its  derivative  equals  \/n(\  —  tt).  By  the  delta 
method,  the  asymptotic  variance  of  the  sample  logit  is  7r  ( 1  —  Tt)/ n  (which  is  the  variance 
of  ft)  multiplied  by  the  square  of  [  1  /7T ( 1  —  tt)].  That  is, 

Vn  (log  — -  log  T^7— )  N  (o,  — — 1 - ^  . 

The  asymptotic  normality  of  ft  propagates  to  asymptotic  normality  of  log[7T /( 1  —  ft)]. 

The  asymptotic  variance  is  the  variance  of  the  normal  distribution  that  approximates 
the  true  distribution,  for  large  n.  It  is  not  an  approximation  for  the  variance  of  the  true 
distribution.  For  0  <  n  <  1,  the  asymptotic  variance  [ntt(\  —  tt)]-1  of  the  sample  logit 
is  finite.  By  contrast,  the  true  variance  does  not  exist:  Since  ft  =  0  or  1  with  positive 
probability,  the  logit  can  equal  — oo  or  oo  with  positive  probability.  The  probability  of  an 
infinite  logit  converges  to  zero  rapidly  as  n  increases.  For  large  n,  the  distribution  of  the 
sample  logit  looks  essentially  normal  with  mean  log[7r/(  1  —  7ij|  and  standard  deviation 
[nn(  1  —  tt ) |  1  /2 .  Thus,  for  the  logit,  the  asymptotic  variance  actually  has  greater  use  than 
the  true  variance.  Incidentally,  related  to  this,  the  ordinary  bootstrap  is  not  helpful  for 
approximating  standard  errors  for  many  discrete  measures,  because  it  mimics  the  true 
rather  than  the  more  relevant  asymptotic  standard  error. 


3.1.7  Delta  Method  for  the  Log  Odds  Ratio 

Standard  errors  for  the  log  odds  ratio  and  the  log  relative  risk  result  from  a  multiparameter 
version  of  the  delta  method.  Suppose  that  {«/,/'  =  1 ,  . . . ,  c)  have  a  multinomial  (n,  {tt,}) 
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distribution.  The  sample  proportion  ir,  =  n,  / n  has  mean  and  variance 

E  (jtj )  =  7T,  and  var(ir,)  =  7r,-(l  —  7ii)/n.  (3.7) 

In  Section  16.1.4  we  show  that  for  i  j,  fij  and  ft,  have  covariance 

COV(?f,  ,  7Tj)  =  —  7TiJTj/n.  (3.8) 

The  sample  proportions  (tti  ,  A  2, . . . ,  jtc_i)  have  a  large-sample  multivariate  normal  dis¬ 
tribution.  For  functions  of  them,  the  delta  method  implies  the  following  result,  proved  in 
Section  16.1.4: 

Let  gin )  denote  a  differentiable  function  of  {7r, ),  with  sample  value  g(7i)  for  a  multino¬ 
mial  sample.  Let 


0/  = 


dg(fr) 

3  71  j 


c. 


Then  as  n  — >  00,  the  distribution  of  */n[g(n)  —  g(n)\/a  converges  to  standard  normal, 
where 


cr2  =  Y^7Ti4>j  ~  ■  (3-9) 

The  asymptotic  variance  depends  on  {7r, }  and  the  partial  derivatives  of  the  measure  with 
respect  to  {tr, }.  In  practice,  replacing  {tt,  }  and  {</>, )  in  (3.9)  by  their  sample  values  yields  an 
ML  estimate  a2  of  a2.  Then  a/^/n  is  an  estimated  standard  error  for  A  large-sample 
Wald  confidence  interval  for  g(ir)  is 

gift)  ±  za/2dl4n. 

With  the  substitution  of  a  for  a  in  (3.9),  the  limiting  distribution  is  still  standard  normal, 
but  convergence  is  slower.  The  equivalence  in  the  large-sample  distribution  is  justified 
as  follows:  The  sample  proportions  converge  in  probability  to  { 7r, } ,  by  the  weak  law  of 
large  numbers.  Since  a  is  a  continuous  function  of  the  sample  proportions,  it  converges  in 
probability  to  a,  and  a /a  converges  in  probability  to  1.  Now 

rg(fi)-g{n)  rg(n)-g{n)o 

o  a  a 

The  first  term  on  the  right-hand  side  converges  in  distribution  to  standard  normal,  by  (3.9), 
and  the  second  term  converges  in  probability  to  1.  Thus,  their  product  also  has  a  limiting 
standard  normal  distribution. 

We  now  apply  the  delta  method  to  the  log  odds  ratio,  taking  g(jr)  =  log  0  =  log  n\\  + 
log  Ji22  ~  log  71  \2  —  log  7T2i .  Since 


0ii  =  3 (log  9)/dn\\  =  1  j 7i\  1 

012  =  —1/^12-  021  =  —  1  /?T21 ,  0 22  =  1/^22, 
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E,  Ey  =  0  and  °2  =  E,  E  j  Kijtfj  =  E;  EyO/^//)-  The  standard  error  of  log  0  for 
a  multinomial  sample  {n,j\  is 


(log  6)  =  cx/s/n  =  ^2  V( n7lij ) 


Since  nftij  =  n,j,  the  estimated  standard  error  is  (3. 1 ). 

3.1.8  Simultaneous  Confidence  Intervals  for  Multiple  Comparisons 

Often,  such  as  in  many  genetics  applications,  there  are  several  groups  to  compare  in 
terms  of  some  parameter.  Multiple  comparison  methods  apply  the  confidence  level  to  the 
simultaneous  set  of  all  comparisons,  rather  than  to  each  individual  one. 

A  simple  multipurpose  although  somewhat  conservative  way  to  establish  control  over 
a  family  of  inferences  is  the  Bonferroni  method.  For  it,  with  g  inferences  we  use  an  error 
probability  of  a*  —  a/g  for  each  one.  For  instance,  to  form  g  confidence  intervals  with 
simultaneous  coverage  probability  of  at  least  1  —  a,  we  use  a  standard  method  but  with 
confidence  level  1  —ot/g  for  each.  This  implies  an  upper  bound  of  a  for  the  probability 
of  at  least  one  error  for  the  entire  set  of  intervals.  Exercise  1.36  applied  the  method 
to  simultaneous  comparison  of  all  pairs  of  multinomial  parameters.  Goodman  (1964a) 
presented  simultaneous  confidence  intervals  for  all  odds  ratios  in  an  /  x  J  table.  Note 
3.2  cites  an  alternative  method  for  comparing  multiple  binomial  parameters.  Section  7.5.2 
further  describes  the  Bonferroni  method,  and  Section  7.5.3  presents  a  less  conservative 
approach  to  multiple  comparisons  in  the  context  of  significance  testing. 


3.2  TESTING  INDEPENDENCE  IN  TWO-WAY  CONTINGENCY  TABLES 

At  first  we  assume  multinomial  sampling  with  joint  probabilities  [Tt,j)  in  an  /  x  J  contin¬ 
gency  table.  The  null  hypothesis  of  statistical  independence  is  Ho:  Tty  =  ni+  n+j  for  all  i 
and  j. 


3.2.1  Pearson  and  Likelihood-Ratio  Chi-Squared  Tests 

In  Section  1 .5.2  we  introduced  the  Pearson  X 2  statistic  (1.16)  for  tests  about  specified  values 
of  multinomial  probabilities.  A  test  of  Ho',  independence  uses  X 2  with  ntj  in  place  of  n ,•  and 
with  Pij  —  nTti+  rt+j  in  place  of  //, .  Here  fi,j  =  E(njj )  under  H0.  Usually,  {7 r,  +  )  and  {7 r+j] 
are  unknown.  Their  ML  estimates  are  the  sample  marginal  proportions  7f,+  =  «,+/«  and 
7T+/  =  n+j/n.  So,  the  estimated  expected  frequencies  are  {/2,y  =  nfii+  fc+j  =  nj+  n+j/ n}. 
Then,  the  Pearson  statistic  is 


-V'-£E 


,-.\2 


(njj  p  jj ) 

Ay 


(3.10) 


Pearson  (1900,  1904,  1922)  claimed  that  replacing  {/x,y}  by  estimates  [fi,j}  would  not 
affect  the  large-sample  distribution  of  X2.  Since  the  contingency  table  has  IJ  categories, 
he  argued  that  X2  is  asymptotically  chi-squared  with  df  —  l  J  —  1.  On  the  contrary,  since 
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{fijj}  require  estimating  {7r,-+}  and  by  Section  1.5.6, 

df  =  (/ 7  -!)-(/-  1)- (7-  !)  =  (/-  1X7  -  1). 

The  dimensions  of  {^,+}  and  {tt+;  }  reflect  the  constraints  22  i  ni+  —  22j  7T+i  =  1.  R.  A. 
Fisher  (1922)  corrected  Pearson’s  error.  Fisher’s  article  introduced  the  notion  of  degrees  of 
freedom.  (Pearson  had  introduced  an  indexed  family  of  chi-squared  distributions  but  had 
not  dealt  explicitly  with  “degrees  of  freedom.”) 

The  score  test  produces  the  X2  statistic.  The  likelihood-ratio  test  produces  a  different 
statistic.  For  multinomial  sampling,  the  kernel  of  the  likelihood  is 

nn  where  all  Ttjj  >  0  and  ES>=i- 

i  j  i  j 

Under  Hq:  independence,  j iy  =  Ttj+Tt+j  =  rij+n+j/n2.  In  the  general  case,  nu  =  «,//«. 
The  ratio  of  the  likelihoods  equals 

A  _  fl,  n+j)n‘> 

nn  n,  n j  »u 

The  likelihood-ratio  chi-squared  statistic  is  —2 log  A.  Denoted  by  G2,  it  equals 

G 2  =  —2  log  A  =  2  EE  /),;/  log(rty/Ay),  (3.11) 

i  j 


The  larger  the  values  of  G2  and  X2,  the  more  evidence  exists  against  independence.  For 
either  statistic,  the  P-value  is  the  right-tail  probability  above  the  observed  value. 

In  the  general  case,  the  parameter  space  consists  of  {7 r,y)  subject  to  the  linear  restriction 
22  i  22  j  71  ij  =  1,  so  the  dimension  is  / 7  —  1 .  Under  H 0,  {7r„}  are  determined  by  { 7r, + } 
and  {tt+;},  so  the  dimension  is  (/  —  l)  +  (7  —  1).  The  difference  in  these  dimensions 
equals  (/  —  1  )(J  —  1).  For  large  samples,  G2  has  a  chi-squared  null  distribution  with 
df  =  (/  —  1)(7  —  1).  So  G2  and  X2  have  the  same  limiting  null  chi-squared  distribution. 
In  fact,  they  are  then  asymptotically  equivalent;  X2  —  G2  converges  in  probability  to  zero 
(Section  16.3.4). 

When  there  are  independent  multinomial  samples  in  the  /  rows,  the  row  marginal  counts 
are  fixed.  Independence  then  corresponds  to  homogeneity  of  each  outcome  probability 
among  the  rows.  Roy  and  Mitra  (1956)  showed  that  the  limiting  chi-squared  results  for  a 
single  multinomial  sample  also  hold  then  (and  for  comparable  statistics  in  three-way  tables), 
as  well  as  when  we  condition  further  on  the  column  marginal  totals.  As  we’ll  discuss  in 
Section  3.5,  conditional  on  row  and  column  marginal  totals,  a  hypergeometric  distribution 
applies  to  the  cell  counts.  In  this  case,  {/r,;j  in  tests  of  independence  are  exact  (rather  than 
estimated)  expected  values.  For  2x2  tables,  for  example. 


cv  a  "i+'i  +  i  .  ,  ,  «l+«  +  l  «2+«+2 

£(«n)= -  and  var(/tn)= - — - - - . 

n  nz(n  —  1) 


For  /  x  J  tables,  Haldane  (1940)  derived  E(X2)  =  (/  —  1)(7  —  l)«/(/i  —  1).  See  Note  3.3 
for  other  moments. 
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Table  3.2  Attained  Education  (Highest  Degree)  and  Belief  in  God 


Belief  in  God 

Highest 

Degree 

Don’t 

Believe 

No  Way  to 
Find  Out 

Some  Higher 
Power 

Believe 

Sometimes 

Believe 

but  Doubts 

Know  God 
Exists 

Total 

Less  than 

9 

8 

27 

8 

47 

236 

335 

high  school 

(10.0)" 

(-0.4)'’ 

(15.9) 

(-2.2) 

(34.2) 

(-1.4) 

(12.7) 

(-1.5) 

(55.3) 

(-1.3) 

(206.9) 

(3.6) 

High  school 

23 

39 

88 

49 

179 

706 

1084 

or 

(32.5) 

(51.5) 

(110.6) 

(41.2) 

(178.9) 

(669.4) 

junior  college 

(-2.5) 

(-2.6) 

(-3.3) 

(1.8) 

(0.0) 

(3.4) 

Bachelor 

28 

48 

89 

19 

104 

293 

581 

or 

(17.4) 

(27.6) 

(59.3) 

(22.1) 

(95.9) 

(358.8) 

graduate 

(3.1) 

(4.7) 

(4.8) 

(-0.8) 

(1.1) 

(-6.7) 

Total 

60 

95 

204 

76 

330 

1235 

2000 

"Estimated  expected  frequencies  for  testing  independence. 
^Standardized  residuals. 

Source :  2008  General  Social  Survey,  National  Opinion  Research  Center. 


3.2.2  Example:  Education  and  Belief  in  God 

Table  3.2  uses  General  Social  Survey  data  to  cross-classify  opinion  about  whether  God 
exists  by  highest  education  degree  attained.  The  table  also  contains  the  estimated  expected 
frequencies  for  Hq :  independence.  For  instance,  /In  =  «i+  n+\/n  =  (335  x  60)/2000  = 
10.0.  The  chi-squared  statistics  are  X2  =  76. 1  and  G 2  =  73.2,  with  df  =  (3  —  1)(6  —  1)  = 
10.  The  P- values  are  <  0.0001.  These  statistics  provide  extremely  strong  evidence  of  an 
association. 


3.2.3  Adequacy  of  Chi-Squared  Approximations 

The  convergence  of  the  actual  sampling  distribution  of  X 2  or  G 2  to  the  chi-squared  distribu¬ 
tion  applies  as  n  grows,  and  hence  {/x,y  =  «7r,y)  grow,  for  a  fixed  number  of  cells.  As  the  cell 
means  grow,  the  multinomial  distribution  for  (n,y)  is  better  approximated  by  a  multivariate 
normal,  and  X2  and  G2  have  more  nearly  chi-squared  distributions.  The  adequacy  of  the 
approximation  depends  on  both  n  and  the  number  of  cells.  The  size  of  nHJ  that  produces 
adequate  approximations  for  X2  tends  to  decrease  as  IJ  increases  (Koehler  and  Larntz 
1980). 

Contingency  tables  having  small  cell  counts  are  said  to  be  sparse.  In  analyzing  the 
chi-squared  approximation  for  X2  in  sparse  tables,  Cochran  (1954)  suggested  that  when 
df  >  1,  a  minimum  expected  value  /x,y  %  1  is  permissible  as  long  as  no  more  than  about 
20%  of  fijj  <  5.  Research  has  shown  that  X2  performs  adequately  with  smaller  n  and  more 
sparse  tables  than  G 2  (see  Note  3.3).  The  distribution  of  G2  is  usually  poorly  approximated 
by  chi-squared  when  n/ 1 J  <  5.  Depending  on  the  sparseness,  P-values  based  on  referring 
G 2  to  a  chi-squared  distribution  can  be  too  large  or  too  small.  When  most  /z,y  are  smaller 
than  0.50,  treating  G 2  as  chi-squared  gives  a  highly  conservative  test;  when  Hq  is  true, 
reported  P-values  tend  to  be  much  larger  than  true  ones.  When  most  /z,y  are  between  0.5 
and  4,  by  contrast,  the  reported  P- value  tends  to  be  too  small. 
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A  caveat  is  that  chi-squared  approximations  tend  to  be  poor  for  tables  containing  both 
very  small  and  moderately  large  /il;  (Haberman  1988).  It  is  difficult  to  give  a  guideline 
that  covers  all  cases.  Small-sample  methods  to  be  presented  in  Section  3.5  are  available 
whenever  it  is  doubtful  whether  n  is  sufficiently  large. 

3.2.4  Chi-Squared  and  Comparing  Proportions  in  2  x  2  Tables 

Often,  a  2  x  2  table  summarizes  results  for  two  independent  binomial  variates  y\  and  y2 
with  rt[  and  n2  trials.  Independence  is  equivalent  to  the  homogeneity  condition,  tc\  =  Jii. 
Under  Hq.  ti\  =  n2,  the  estimated  common  value  of  n\  =  7T2  is  ft  =  (yi  +  yi)/(n\  +  n2). 
The  2  score  test  statistic 


has  denominator  that  is  the  standard  error  of  n i  —  fi2  estimated  under  H0.  This  statistic  has 
an  asymptotic  standard  normal  null  distribution. 

This  statistic  relates  to  the  Pearson  statistic  for  testing  independence  in  the  2  x  2  table 
by  z2  =  X2.  Recall  that  if  a  statistic  z  has  an  approximate  standard  normal  distribution, 
then  z2  has  an  approximate  chi-squared  distribution  with  df  =  1,  which  is  (/  —  1)(7  —  1) 
applied  with  I  —  J  —  2. 

A  simple  formula  for  X 2  for  2  x  2  tables  is 

X2  _  «(«11  ”22  -  ”l2ft2l)2 
n]+n2+n+[n+2 

For  example,  for  the  2  x  2  table  having  entries  (3,  0/0,  3),  by  row,  used  for  an  example  in 
Section  3.5.6  on  small-sample  inference, 

X2  =  [6(3  x  3  -  0  x  0)2]/(3  x  3  x  3  x  3)  =  6.0. 

Section  5.3.5  shows  a  generalized  formula  for  comparing  I  proportions  in  /  x  2  tables. 
Mirkin  (2001)  showed  alternative  X2  formulas  for  I  x  J  tables. 

3.2.5  Score  Confidence  Intervals  Comparing  Proportions 

The  Wald  confidence  intervals  for  the  difference  of  proportions,  odds  ratio,  and  relative 
risk  presented  in  Section  3. 1  are  simple  but  have  disadvantages:  They  are  dependent  on  the 
scale  of  measurement  [e.g.,  a  Wald  interval  is  not  the  same  for  9  as  when  found  for  log(f?) 
and  then  exponentiated],  they  fail  when  an  estimate  falls  at  the  boundary  of  the  parameter 
space  [e.g.,  a  cell  count  of  0  causing  log(0)  =  ±oo  and  cr (log  9)  =  oo],  and  they  can  have 
actual  probability  of  covering  the  parameter  quite  far  from  the  nominal  level  unless  n  is 
quite  large.  Alternative  intervals  that  result  from  inverting  score  tests  or  likelihood-ratio 
tests  do  not  have  these  disadvantages.  These  tests  use  extensions  of  the  X2  or  G2  statistics 
that  apply  to  nonnull  values  of  the  parameters.  Although  computationally  more  complex 
than  the  Wald  method,  this  should  not  be  an  impediment  to  their  use  in  this  modern  era  of 
computing,  as  the  principle  behind  them  is  straightforward. 
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We  illustrate  the  score  method  for  forming  an  interval  for  the  difference  of  proportions. 
Consider  testing  Hq:  tx\  —712  =  Ao,  where  Ao  need  not  be  0.  Let  ft  i(Ao)  and  7T2(Ao)  denote 
the  ML  estimates  of  tx\  and  112  subject  to  the  constraint  it\  —  1X2  =  Ao-  That  is,  7f|(A0)  and 
7T2(Ao)  are  the  values  of  tx\  and  1x2  satisfying  tx\  —  7x2  =  Ao  that  maximize  the  product  of 
the  two  binomial  probability  mass  functions.  The  score  test  statistic  is 


z(A0)  = 


(if  1  -  7T2)  -  A0 


7Ti  (  Aq)[  1  -  ff|(Ap)]  7T2(Aq)[1  -  jf2(A0)] 


«l 


n  2 


The  score  confidence  interval  is  the  set  of  Ao  such  that  |z(Ao)|  <  z„/2  (Mee  1984).  For 
given  Ao,  each  717 (Ao)  and  hence  z(Ao)  can  be  found  explicitly,  but  finding  the  endpoints 
of  the  interval  requires  iteration  (Nurminen  1986). 

For  Ao  =  0,  the  test  statistic  z(Ao)  simplifies  to  the  pooled  statistic  (3.12)  for  comparing 
two  proportions.  Then,  [z(Aq)]2  is  the  Pearson  X2  statistic.  For  Ao  ^  0  this  square  is  a 
nonnull  type  of  Pearson  statistic.  Unlike  the  Wald  interval,  the  score  interval  is  coherent 
with  the  result  of  the  Pearson  chi-squared  test  of  independence;  for  instance,  the  P-value  for 
that  test  falling  below  0.05  is  equivalent  to  the  95%  score  confidence  interval  for  7rj  —  7x2 
not  containing  0.  For  Table  2. 1  on  aspirin  use  and  heart  attacks,  the  95%  score  interval  for 
7r  1  —  tx2  is  (0.0004,  0.0022). 

Score-test-based  confidence  intervals  have  also  been  proposed  for  the  odds  ratio  (Corn¬ 
field  1956)  and  for  the  relative  risk  (Koopman  1984).  We  illustrate  for  the  odds  ratio  for  a 
multinomial  sample  over  the  cells  of  the  2  x  2  table.  Recall  that  the  joint  distribution  {717/} 
can  equivalently  be  expressed  in  terms  of  (0,  tx\  +  ,  7r+i)  (Section  2.4.1).  Fora  given  nonnull 
odds  ratio  value  Qq,  let  {Ai)'(Ao)}  be  the  unique  expected  frequency  estimates  that  have  the 
same  row  and  column  margins  as  {n,y|  and  satisfy 

All(flo)A22(flo)  _  . 

Al2(#o)A2l(0o) 

The  set  of  0o  satisfying 

X2m  =  £>/,  -  A,#o))2/A,;(0o)  <  X?(«) 


form  a  100(1  —  of)%  score-test-based  confidence  interval.  This  interval  is  also  coherent 
with  the  result  of  the  Pearson  chi-squared  test,  for  Hq:  0=1.  This  95%  score  interval  for 
the  odds  ratio  for  Table  3. 1  on  seat-belt  use  and  traffic  accidents  is  (6.76,  17.35). 


3.2.6  Profile  Likelihood  Confidence  Intervals 

Likewise,  we  can  construct  confidence  intervals  by  inverting  likelihood-ratio  tests  for 
nonnull  parameter  values.  We  illustrate  with  the  odds  ratio.  For  {Ay(Ao))  as  just  defined, 
the  set  of  0q  satisfying 

C2(0O)  =  2  EE  riij  Iog[«y-/A,y(A0)]  <  X|2(a) 

‘  j 
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form  a  100(1  —  a)%  likelihood-ratio  test-based  confidence  interval.  The  95%  interval  for 
the  odds  ratio  for  Table  3.1  on  seat-belt  use  and  traffic  accidents  is  (6.82,  17.70). 

More  generally,  in  later  chapters  we’ll  often  construct  a  confidence  interval  for  a  model 
parameter  fi,  regarding  the  other  parameters  in  the  model  as  nuisance  parameters.  Denote 
those  nuisance  parameters,  such  as  the  marginal  probabilities  in  a  2  x  2  table  when  we  are 
estimating  an  odds  ratio,  by  fi.  In  inverting  a  likelihood-ratio  test  of  //0:  =  fio  to  check 

whether  fio  belongs  in  the  confidence  interval,  the  ML  estimate  fi(fio)  that  maximizes 
the  likelihood  under  the  null  varies  as  fio  does.  The  profile  log-likelihood  function  is 
L(fi o,  fi(fio)),  viewed  as  a  function  of  fio-  For  each  fio  this  function  gives  the  maximum  of 
the  ordinary  log  likelihood  subject  to  the  constraint  fi  =  fio.  Evaluated  at  fio  =  ft,  this  is 
the  maximized  log  likelihood  Lift,  ),  which  occurs  at  the  unrestricted  ML  estimates.  The 
profile  likelihood  confidence  interval  for  fi  is  the  set  of  fio  for  which 

-2 [L(A>,  i HPo))  ~  UP,  xfi)]-  <  /,2(a). 

The  interval  contains  all  /i(l  not  rejected  in  likelihood-ratio  tests  of  nominal  size  a. 

Score  intervals  currently  are  available  only  in  specialized  software,  such  as  R  functions 
given  in  this  book’s  computing  appendix.2  The  profile  likelihood  approach  is  more  generally 
available,  for  example  the  confint()  function  in  R,  the  LRCI  option  in  PROC  GENMOD 
and  the  PLCL  option  in  PROC  LOGISTIC  in  SAS,  and  the  pllf  command  in  Stata. 


3.3  FOLLOWING-UP  CHI-SQUARED  TESTS 

Like  any  significance  test,  chi-squared  tests  of  independence  have  limited  usefulness.  A 
small  P-value  indicates  strong  evidence  of  association  but  provides  little  information  about 
the  nature  or  strength  of  the  association.  Statisticians  have  long  warned  about  dangers 
of  relying  solely  on  results  of  chi-squared  tests  rather  than  studying  the  nature  of  the 
association  (e.g.,  Berkson  1938,  Cochran  1954).  In  this  section  we  discuss  ways  to  follow 
up  the  tests  to  learn  more  about  the  association. 


3.3.1  Pearson  Residuals  and  Standardized  Residuals 

A  cell-by-cell  comparison  of  observed  and  estimated  expected  frequencies  helps  show  the 
nature  of  the  dependence.  Under  H0,  larger  differences  {ny  —  fly)  tend  to  occur  in  cells 
with  larger  fly.  Thus,  this  raw  difference  is  insufficient.  The  Pearson  residual,  defined  for 
a  cell  by 


e 


ij  ~ 


nij  A(/ 

VA</ 


(3.13) 


attempts  to  adjust  for  this.  The  name  “Pearson”  results  from  {ey}  relating  to  the  Pearson 
statistic  by  X2  =  J2j  ey- 

Under  Ho,  { ey }  are  asymptotically  normal  with  mean  0.  However,  their  asymptotic 
variances  are  less  than  1.0,  averaging  [(/  —  1)(7  —  1)]//J.  A  standardized  residual  that 


2See  www . stat . uf 1 . edu/~aa/cda/ cda . html . 
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is  asymptotically  standard  normal  results  from  dividing  (n,j  —  fi,/)  by  its  standard  error 
(Haberman  1973a,  Sec.  16.3.2).  For  Hq:  independence,  this  is 


.  _  _ n‘j  P'ij _ 

U  ~  Pi+)(  1  ~  P+j) 


(3.14) 


In  2  x  2  tables,  df  =  1  and  r\ \  =  —r\2  =  —r2t  =  r22  and  any  r2  =  X2.  By  contrast,  all  four 
Pearson  residuals  can  take  different  values,  which  is  unappealing. 

A  standardized  residual  that  exceeds  about  2  or  3  in  absolute  value  indicates  lack  of  fit 
of  Hq  in  that  cell.  Larger  values  are  more  relevant  when  df  is  larger,  as  it  becomes  more 
likely  that  at  least  one  such  residual  is  large  simply  by  chance. 


3.3.2  Example:  Education  and  Belief  in  God  Revisited 

Table  3.2  also  shows  standardized  residuals  for  testing  independence.  For  instance,  n  V)  = 
293  and  fi^  =  358.8.  The  relevant  marginal  proportions  equal  p2+  =  581/2000  =  0.2905 
and  p+g  =  1235/2000  =  0.6175.  The  standardized  residual  (3.14)  for  this  cell  equals 

r36  =  (293  -  358.8)A/358.8(1  -0.2905)(1  -0.6175)  =  -6.7. 

We  can  infer  that,  in  the  population  in  2008,  fewer  people  at  the  highest  level  of  education 
would  have  responded  “know  God  exists”  than  if  the  variables  were  truly  independent. 

For  the  “know  God  exists”  category.  Table  3.2  shows  large  positive  residuals  for  subjects 
with  a  junior  college  education  or  less.  We  can  infer  that  more  subjects  at  these  education 
levels  had  this  opinion  than  if  Hq:  independence  were  true.  Other  large  positive  residuals 
occur  in  the  first  three  categories  of  belief  in  God  for  those  with  at  least  a  bachelor  degree, 
suggesting  those  cells  are  also  more  common  than  we’d  expect  under  independence. 

Figure  3.2  is  a  mosaic  plot  for  Table  3.2.  Mosaic  plots  portray  the  counts  by  tiles 
(rectangles)  whose  size  is  proportional  to  the  cell  count.  Under  independence,  the  vertical 
lines  would  match  up  at  the  same  spot  in  each  row.  Color  and  depth  of  shading  of  the  tiles 
can  represent  the  sign  and  magnitude  of  standardized  residuals  (Friendly  1994).  The  scale 
on  the  right  of  the  figure  shows  the  magnitude  of  the  standardized  residuals. 


3.3.3  Partitioning  Chi-Squared 

Another  supplement  to  a  chi-squared  test  uses  the  reproductive  property  of  chi-squared 
(Section  1.2.6)  to  partition  the  test  statistic  so  that  the  components  represent  certain  aspects 
of  the  effects.  A  partitioning  may  show  that  an  association  reflects  primarily  differences 
between  certain  categories  or  groupings  of  categories. 

We  begin  with  a  partitioning  for  the  test  of  independence  in  2  x  J  tables.  We  partition 
G2,  which  has  df  =  (J  —  1),  into  J  —  1  components.  The  jth  component  is  G2  for  a  2  x  2 
table  where  the  first  column  combines  columns  1  through  j  of  the  full  table  and  the  second 
column  is  column  j  +  1.  That  is,  G2  for  testing  independence  in  a  2  x  J  table  equals  a 
statistic  that  compares  the  first  two  columns,  plus  a  statistic  that  combines  the  first  two 
columns  and  compares  them  to  the  third  column,  and  so  on,  up  to  a  statistic  that  combines 


Figure  3.2  Mosaic  plot  for  data  in  Table  .3.2.  Figure  3.2.  when  produced  with  a  mosaic(  )  function  in  R.  has 
blue  tiles  (labeled  b  here)  for  positive  residuals  and  red  (labeled  r  here)  for  negative,  with  dark  color  when  the 
standardized  value  exceeds  4. 


the  first  J  —  I  columns  and  compares  them  to  the  last  column.3  Each  component  statistic 
has  df  s=  I . 

It  might  seem  more  natural  to  compute  G2  for  the  (./  —  I)  separate  2x2  tables  that  pair 
each  column  with  a  particular  one,  say,  the  last.  Such  an  analysis  can  be  informative,  but 
these  component  statistics  are  not  independent  and  do  not  sum  to  G2  for  the  full  table.  This 
is  beyond  our  scope  at  this  stage  but  relates  to  the  contrasts  of  log  probabilities  that  form 
the  log  odds  ratios  for  the  two  tables  not  being  orthogonal. 

For  an  /  x  J  table,  independent  chi-squared  components  result  from  comparing  columns 
1  and  2  and  then  combining  them  and  comparing  them  to  column  3,  and  so  on.  Each  of  the 
J  -  I  statistics  has  df  =  /  —  1.  More  refined  partitions  contain  (/  —  I  )(J  —  1)  statistics, 
each  having  df  =  I.  One  such  partition  (Lancaster  1949a)  applies  to  the  (/  —  !)(/  —  I) 
separate  2x2  tables 


£  £  "ah 

£  "<o 

a<i  h<j 

<!</ 

£  nih 
b<j 

na 

(3.15) 


3  In  Section  10.2.4  we  explain  why  this  partitioning  works. 
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Table  3.3  Most  Influential  School  of  Psychiatric  Thought  and  Ascribed 
Origin  of  Schizophrenia 


School  of  Psychiatric 
Thought 

Origin  of  Schizophrenia 

Biogenic 

Environmental 

Combination 

Eclectic 

90 

12 

78 

Medical 

13 

1 

6 

Psychoanalytic 

19 

13 

50 

Source:  Reprinted  with  permission,  based  on  data  from  B.  J.  Gallagher  III,  B.  J,  Jones, 
and  L.  P.  Barakat,  J.  Clin.  Psychol.  43:  438^143,  1987. 


for  /  =2 and  j  =  2, . . . ,  J .  For  others,  see  Gilula  and  Haberman  (2005)  and  Good¬ 
man  (1969,  1971b). 

3.3.4  Example:  Origin  of  Schizophrenia 

Table  3.3  classifies  a  sample  of  psychiatrists  by  their  school  of  psychiatric  thought  and  by 
their  opinion  on  the  origin  of  schizophrenia.  Here  G 2  =  23.04  with  df  =  4.  To  understand 
this  association  better,  we  partition  G 1  into  four  independent  components. 

The  partitioning  (3.15)  applies  to  the  subtables  shown  in  Table  3.4.  The  first  subtable 
compares  the  eclectic  and  medical  schools  of  psychiatric  thought  on  whether  the  origin  of 
schizophrenia  is  biogenic  or  environmental  given  that  the  classification  was  in  one  of  these 
two  categories.  For  this  subtable,  G 2  =  0.29,  with  df  =  1.  The  second  subtable  compares 
these  two  schools  on  the  proportion  of  times  the  origin  was  ascribed  to  be  a  combination, 
rather  than  biogenic  or  environmental.  This  subtable  has  G2  =  1 .36,  with  df  =  1 .  The  sum 
of  these  two  components  equals  G2  for  testing  independence  with  the  first  two  rows  of 
Table  3.3.  There  is  little  evidence  of  a  difference  between  the  eclectic  and  medical  schools 
of  thought  on  the  ascribed  origin  of  schizophrenia. 

Next,  we  combine  the  eclectic  and  medical  schools  and  compare  them  to  the  psychoana¬ 
lytic  school .  The  third  subtable  in  Table  3.4  compares  them  for  the  (biogenic,  environmental) 
classification,  giving  G 2  =  1 2.95  with  df  =  1 .  The  fourth  subtable  compares  them  for  the 
(biogenic  or  environmental,  combination)  split,  giving  G2  =  8.43  with  df  =  1 . 

The  psychoanalytic  school  seems  more  likely  than  the  other  schools  to  ascribe  the 
origins  of  schizophrenia  as  being  a  combination.  Of  those  who  chose  either  the  biogenic  or 
environmental  origin,  members  of  the  psychoanalytic  school  were  somewhat  more  likely 
than  the  other  schools  to  choose  the  environmental  origin.  The  sum  of  these  four  G2 
components  equals  the  value  of  23.04  for  testing  independence  in  the  full  3x3  table. 


Table  3.4  Subtables  Used  in  Partitioning  Chi-Squared  for  Table  3.3“ 


Bio 

Env 

Bio  + 

Env 

Com 

i 

: 

Bio 

Env  i 

Bio  + 

Env 

Com 

Eel 

90 

12 

Eel 

102 

78 

Eel  +  Med 

103 

13 

Eel  +  Med 

116 

84 

Med 

13 

1 

Med 

14 

6 

Psy 

19 

13 

Psy 

32 

50 

"Bio,  biogenic;  Com,  combination;  Eel,  eclectic;  Env,  environmental;  Psy,  psychoanalytic. 
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3.3.5  Rules  for  Partitioning 

Goodman  (1968,  1969,  1971b)  and  Lancaster  (1949a,  1969)  gave  rules  for  determining 
independent  components  of  chi-squared.  For  forming  subtables,  among  the  necessary  con¬ 
ditions  are  the  following: 

1.  The  df  for  the  subtables  must  sum  to  the  df  for  the  full  table. 

2.  Each  cell  count  in  the  full  table  must  be  a  cell  count  in  one  and  only  one  subtable. 

3.  Each  marginal  total  of  the  full  table  must  be  a  marginal  total  for  one  and  only  one 
subtable. 

For  a  certain  partitioning,  when  the  subtable  df  values  sum  properly  but  the  G2  values  do 
not,  the  components  are  not  independent. 

For  the  G 2  statistic,  exact  partitionings  occur.  The  Pearson  X 2  need  not  equal  the  sum  of 
the  X 2  values  for  the  subtables.  It  is  valid  to  use  the  X2  statistics  for  the  separate  subtables; 
they  simply  do  not  provide  an  exact  algebraic  partitioning  of  X2  for  the  full  table.  When 
the  null  hypotheses  all  hold,  X2  does  have  an  asymptotic  equivalence  with  G 2,  however.  In 
addition,  when  the  table  has  small  counts  and  we  rely  on  large-sample  distributions,  it  is 
safer  to  use  X2  than  G 2  to  analyze  the  subtables. 

3.3.6  Summarizing  the  Association 

Residual  analyses  and  partitioning  of  chi-squared  are  both  inferential  methods.  They  provide 
information  about  whether  there  is  an  association  and  its  nature,  but  in  an  inferential  manner. 
For  example,  as  n  increases  and  there  truly  is  an  association,  standardized  residuals  tend  to 
be  larger  in  magnitude,  but  they  do  not  describe  the  strength  of  association. 

To  describe  the  strength  of  association,  we  can  use  measures  introduced  in  the  previous 
chapter,  such  as  the  odds  ratio,  by  applying  them  to  either  subtables  or  collapsings  of 
the  table.  We  illustrate  with  Table  3.2  on  education  and  belief  in  God.  The  2x2  table 
constructed  by  combining  the  first  two  rows  and  combining  the  first  five  columns  has  a 
sample  odds  ratio  of  (477  x  293)/(942  x  288)  —  0.52.  For  those  with  at  least  a  bachelor’s 
degree,  the  estimated  odds  of  responding  “know  God  exists”  were  0.52  times  the  estimated 
odds  for  those  with  less  than  a  bachelor’s  degree.  Likewise,  we  can  use  measures  such 
as  differences  and  ratios  of  proportions.  For  example,  the  sample  proportion  responding 
“know  God  exists”  was  0.704  for  those  with  less  then  a  high  school  education  and  0.504 
for  those  with  a  bachelor’s  degree  or  higher,  fora  difference  of  0.20  and  a  ratio  of  1.40.  We 
can  also  construct  confidence  intervals  for  such  parameters,  as  discussed  in  Sections  3.1, 
3.2.5,  and  3.2.6. 

A  useful  summary  of  the  degree  to  which  cells  depart  from  independence  compares 
cell  counts  with  the  independence  fit  by  the  estimates  {a(/  =  n,y//l,y  =  /?,,/( p,+ /?+/)}  of  the 
association  factors  (Section  2.4.2).  For  those  with  the  highest  degree  who  responded  “know 
God  exists,”  this  is  =  293/[(581)(1235)/2000]  =  0.82;  that  is,  the  observed  count  was 
82%  of  what  independence  predicts. 

3.3.7  Limitations  of  Chi-Squared  Tests 

Chi-squared  tests  of  independence  merely  indicate  the  degree  of  evidence  of  association. 
They  are  rarely  adequate  for  answering  all  questions  about  a  data  set.  Rather  than  relying 
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solely  on  results  of  these  tests,  investigate  the  nature  of  the  association:  Look  at  the 
standardized  residuals,  decompose  chi-squared  into  components,  and  estimate  parameters 
that  describe  the  strength  of  association. 

The  chi-squared  tests  also  have  limitations  in  the  types  of  data  to  which  they  apply.  For 
instance,  they  require  large  samples.  Also,  {fi,j  —  m+n+j/n}  used  in  X2  and  G 2  depend 
on  the  marginal  totals  but  not  on  the  order  of  listing  the  rows  and  columns.  Thus,  X 2  and 
G2  do  not  change  value  with  arbitrary  reorderings  of  rows  or  of  columns.  This  implies 
that  they  treat  both  classifications  as  nominal.  When  at  least  one  variable  is  ordinal,  test 
statistics  that  utilize  the  ordinality  are  usually  more  appropriate.  We  present  such  tests  in 
Section  3.4. 


3.3.8  Why  Consider  Independence  If  It’s  Unlikely  to  Be  True? 

Any  idealized  structure  such  as  independence  is  unlikely  to  hold  in  many  situations.  With 
large  samples  such  as  in  Table  3.2,  it  is  not  surprising  to  obtain  a  small  P-value.  Given  this 
and  the  limitations  just  mentioned,  why  even  bother  to  consider  independence  as  a  possible 
representation  for  ajoint  distribution? 

One  reason  refers  to  the  benefits  of  parsimony,  using  fewer  parameters  to  describe  the 
data.  The  estimates  {7 r,y  =  nj+n+j/n2)  of  the  cell  probabilities  are  based  on  estimating  the 
(/  -  l)  +  (7  -  1)  marginal  probability  parameters  {  tt, + }  and  {zr+y }.  By  contrast,  thesample 
proportions  (p,y  =  n,j/n)  are  based  on  estimating  the  /./  —  1  cell  probability  parameters 
{7T,y}.  When  the  independence  hypothesis  approximates  the  true  probabilities  well,  unless 
n  is  very  large  the  independence-based  ML  estimates  tend  to  be  better  than  the  sample 
proportions.  The  independence  estimates  smooth  the  sample  counts,  somewhat  damping 
the  random  sampling  fluctuations.  This  is  the  same  reason  that  we  use  models  to  smooth 
data  in  the  rest  of  the  text. 

The  mean  squared  error  (MSE)  formula 

MSE  =  variance  +  (bias)2 


explains  why  the  independence  estimators  can  have  smaller  MSE.  Although  they  may  be 
biased,  they  have  smaller  variance  because  they  are  based  on  estimating  fewer  parame¬ 
ters.  Hence,  MSE  can  be  smaller  unless  n  is  so  large  that  the  bias  term  dominates  the 
variance. 

We  illustrate  using  Table  3.5,  which  has  7r,y  —  7r,+  7r+;[l  +  8(i  —  2 ){j  —  2)]  for  717+  = 
7 r+j  —  I.  Here  —  1  <  5  <  1,  with  8  =  0  equivalent  to  independence.  When  8  is  close  to 
zero,  independence  approximates  the  relationship  well.  The  total  MSE  values  of  the  two 
estimators  are 


mse({  Pij})  =  E  E  E(p»  -  jr'/)2  =  E  E  var  (pij) 

i  J  *  J 

=eev‘-^^(i-ee4 

i  j  V  i  j  7 

MSE({%})  -  EE  E(jtjj  —  7T,y  )2. 
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Table  3.5  Cell  Probabilities  for  MSE 
Comparison  of  Estimators 


(I  +  S)/9 

1/9 

(1  -S)/9 

1/9 

1/9 

1/9 

(1  -  S)/9 

1/9 

(1  +S)/9 

For  Table  3.5, 


1  /  8 

MSE  ({Pij})  =  -  - 
n  \9 


and  rather  tedious  calculations  yield 


MSE({jTy}) 


Table  3.6  lists  the  total  MSE  values  for  various  8  and  n.  When  5=0,  MSE({p,y})  =  8/9 «, 
whereas  MSE({jf,;})  ~  4/9 n  for  large  n.  The  independence  estimator  is  then  much  better. 
When  the  table  is  close  to  independence  (5  %  0)  and  n  is  not  large,  MSE  is  only  about  half 
as  large  for  the  independence  estimator.  When  5^0,  the  inconsistency  of  {A ,y}  is  reflected 
by  MSE({%})  — >■  452/81  [whereas  MSE((p,y})  — >  0]  as  n  — >  oo.  When  the  table  is  close 
to  independence,  however,  the  independence  estimator  has  a  lower  total  MSE  even  for 
moderately  large  n  (e.g.,  for  n  =  500  when  5  =0.1). 


3.4  TWO-WAY  TABLES  WITH  ORDERED  CLASSIFICATIONS 

The  X2  and  G2  chi-squared  tests  ignore  some  information  when  used  to  test  independence 
between  ordinal  classifications.  When  rows  and/or  columns  are  ordered,  other  tests  that 
take  the  ordering  into  account  are  usually  more  powerful. 

3.4.1  Linear  Trend  Alternative  to  Independence 

When  the  row  variable  X  and  the  column  variable  Y  are  ordinal,  a  positive  or  negative  trend 
in  the  association  is  common.  One  approach  to  inference,  described  later  in  this  section,  uses 
an  ordinal  measure  of  monotone  trend.  An  alternative  analysis  assigns  scores  to  categories 
and  summarizes  the  linear  trend  component  of  the  association. 


Table  3.6  Comparison  of  Total  MSEt  x  10,000)  for  Sample  Proportion  ( p,y )  and 
Independence  (jr,y)  Estimators  of  the  Cell  Probabilities  in  Table  3.5 


n 

s 

=  0 

8 

=  0.1 

8 

=  0.2 

8 

=  0.6 

8  = 

1.0 

p 

ft 

P 

ft 

P 

ft 

P 

ft 

P 

ft 

10 

889 

489 

888 

493 

887 

505 

871 

634 

840 

893 

50 

178 

91 

178 

95 

177 

110 

174 

261 

168 

565 

100 

89 

45 

89 

50 

89 

65 

87 

220 

84 

529 

500 

18 

9 

18 

14 

18 

28 

17 

186 

17 

500 

OO 

0 

0 

0 

5 

0 

20 

0 

178 

0 

494 
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A  test  statistic  that  is  sensitive  to  positive  or  negative  linear  trends  utilizes  correlation 
information.  Let  u  \  <  «2  <  •  ••  <  «/  denote  scores  for  the  rows,  and  let  V]  <  <  •  ••  <  vj 

denote  column  scores.  The  scores  have  the  same  ordering  as  the  categories.  They  assign 
distances  between  categories  and  actually  treat  the  measurement  scale  as  interval,  with 
greater  distances  between  categories  that  are  farther  apart. 

The  sum  ,  uivjPij  weights  cross-products  of  scores  by  their  relative  frequency, 

Pij  =  rijj/n.  It  relates  to  the  covariation  of  X  and  Y .  For  the  scores  chosen,  the  correlation 
r  between  X  and  Y  equals  the  standardization  of  this  sum  to  the  —  1  to  + 1  scale.  (In  fact,  r 
equals  this  sum  when  both  sets  of  scores  are  linearly  transformed  for  the  n  subjects  to  have 
a  mean  of  0  and  standard  deviation  of  1.)  The  larger  r  is  in  absolute  value,  the  farther  the 
data  fall  from  independence  in  this  linear  dimension. 

A  statistic  for  testing  independence  against  the  two-sided  alternative  of  nonzero  true 
correlation  is 


M2  =  (n  —  1  )r2.  (3.16) 

This  statistic  increases  as  \r\  or  n  does.  For  large  samples,  it  is  approximately  chi-squared 
with  df  =  1  (Mantel  1963,  Yates  1948).  Large  values  contradict  independence,  so  as  with 
X2  and  G2,  the  P- value  is  the  right-tail  probability  above  the  value  observed.  A  small 
P- value  does  not  imply  that  the  association  is  linear,  but  merely  that  the  linear  component 
of  the  association  is  significant.  The  test  treats  the  variables  symmetrically. 

3.4.2  Example:  Is  Happiness  Associated  with  Political  Ideology? 

Table  3.7  cross-classifies  degree  of  happiness  by  political  ideology  for  all  subjects  aged 
over  65  in  the  2008  GSS.  The  Pearson  chi-squared  statistics  for  testing  independence  is 
X2  =  7.07  with  df  =  4  (P-value  =  0. 1 3).  This  statistic  shows  little  evidence  of  association, 
but  it  ignores  the  ordering  of  rows  and  columns.  With  scores  (1,  2,  3)  for  each  variable, 
the  correlation  is  r  =  0.135.  The  linear  trend  test  statistic  M 2  —  (321  —  1)(0.135)2  =  5.85 
with  df  =  1 .  This  shows  strong  evidence  of  association  (P  =  0.0 1 6). 

The  nontrivial  evidence  of  association  may  be  surprising,  since  X2  has  such  an  unim¬ 
pressive  value.  When  a  positive  or  negative  trend  exists,  analyses  designed  to  detect  that 
trend  have  greater  power  and  tend  to  provide  smaller  P-values  than  analyses  that  ignore  it. 

3.4.3  Monotone  Trend  Alternatives  to  Independence 

Ordinal  variables  do  not  have  a  specified  metric.  The  method  of  detecting  a  linear  trend 
alternative  to  independence  requires  assigning  scores  to  X  and  Y ,  treating  them  as  interval 


Table  3.7  Happiness  and  Political  Ideology 


Political 

Ideology 

Happiness 

Not  too  Happy 

Pretty  Happy 

Very  Happy 

Liberal 

13 

29 

15 

Moderate 

23 

59 

47 

Conservative 

14 

67 

54 

Source:  2008  General  Social  Survey. 
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variables.  Alternatively,  we  can  add  more  structure  and  perform  inference  about  a  correlation 
for  an  assumed  underlying  continuous  distribution,  as  the  polychoric  correlation  does  with 
the  normal  distribution  (Section  2.4.8).  In  the  opposite  direction,  a  strict  ordinal  analysis 
with  the  weaker  alternative  of  monotonicity  uses  an  ordinal  measure  of  association,  such 
as  gamma  (Section  2.4.5).  Inference  is  available  with  each  of  these  approaches. 

For  example,  with  large  random  samples,  sample  gamma  has  approximately  a  normal 
sampling  distribution.  The  standard  error  follows  from  the  delta  method  (Goodman  and 
Kruskal  1963).  Gamma  is  the  basis  of  an  ordinal  test  of  independence  using  test  statistic 
z  =  y  /SE.  A  confidence  interval  describes  the  strength  of  positive  or  negative  monotone 
association.  It  is  also  possible  to  use  as  test  statistic  the  ratio  of  (C  —  D)/SEq  for  a  null 
standard  error  obtained  under  the  condition  of  independence  (Agresti  2010,  Sec.  7.3.3). 

For  Table  3.7  on  happiness  and  political  ideology,  y  =  0.185.  The  sample  has  a  weak 
tendency  for  happiness  to  increase  as  political  conservatism  increases.  Software4  reports 
a  standard  error  of  0.078  for  gamma.  There  is  considerable  evidence  that  the  population 
value  y  >  0,  since  z  =  0.185/0.078  =  2.37  ( P  =  0.018  for  the  two-sided  alternative).  An 
approximate  95%  confidence  interval  for  y  is  0. 1 85  ±  1 .96(0.078),  or  (0.032,  0.338).  The 
true  association  seems  to  be  relatively  weak  and  could  be  very  weak. 


3.4.4  Extra  Power  with  Ordinal  Tests 

For  testing  independence,  X2  and  G 2  refer  to  the  most  general  alternative,  whereby  cell 
probabilities  exhibit  any  type  of  statistical  dependence.  Their  df  value  of  (/  —  1)(7  —  1) 
reflects  an  alternative  hypothesis  that  has  (/  -  1)(7  -  1)  more  parameters  than  the  null 
hypothesis — the  nonredundant  odds  ratios  that  describe  the  association  [such  as  (2.10)]. 
These  statistics  are  designed  to  detect  any  pattern  for  these  parameters.  In  achieving  this 
generality,  they  sacrifice  sensitivity  for  detecting  particular  patterns. 

By  contrast,  the  analyses  for  ordinal  row  and  column  variables  describe  association 
using  a  single  parameter.  For  instance,  M 2  uses  the  correlation.  When  a  chi-squared  test 
statistic  refers  to  a  single  parameter  [such  as  M 2  or  (y /SE)2  does],  it  has  df  =  1.  When 
the  association  truly  has  a  positive  or  negative  trend,  an  ordinal  test  has  a  power  advantage 
over  the  tests  using  X2  or  G 2 .  Since  df  equals  the  mean  of  the  chi-squared  distribution,  a 
relatively  large  M 2  value  with  df  =  1  falls  farther  out  in  its  right-hand  tail  than  a  comparable 
value  of  X2  or  G2  with  df  =  (/  —  !)(/  —  1);  falling  farther  out  in  the  tail  produces  a  smaller 
P-value.  The  potential  discrepancy  in  power  increases  as  /  and  J  increase.5 


3.4.5  Sensitivity  to  Choice  of  Scores 

Often,  it  is  unclear  how  to  assign  scores  to  statistics  that  require  them,  such  as  M 1  in  Section 
3.4.1.  Cochran  (1954)  noted  that  “any  set  of  scores  gives  a  valid  test,  provided  that  they  are 
constructed  without  consulting  the  results  of  the  experiment.  If  the  set  of  scores  is  poor,  in 
that  it  badly  distorts  a  numerical  scale  that  really  does  underlie  the  ordered  classification, 
the  test  will  not  be  sensitive.  The  scores  should  therefore  embody  the  best  insight  available 
about  the  way  in  which  the  classification  was  constructed  and  used.”  Ideally,  the  scale  is 
chosen  by  a  consensus  of  experts,  and  subsequent  interpretations  use  that  same  scale. 

4For  example,  PROC  FREQ  in  SAS. 

sIn  Section  5.3.8  we  present  the  theory  behind  such  a  power  comparison. 
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How  sensitive  are  analyses  to  the  choice  of  scores?  There  is  no  simple  answer.6  For  most 
data  sets,  different  choices  of  monotone  scores  give  similar  results.  Scores  that  are  linear 
transforms  of  each  other,  such  as  ( 1 , 2,  3,  4)  and  (0,  2,  4,  6),  have  the  same  absolute  corre¬ 
lation  and  hence  the  same  A/2.  Results  may  depend  on  the  scores,  however,  when  the  data 
are  highly  unbalanced,  with  some  categories  having  many  more  observations  than  others. 


3.4.6  Example:  Infant  Birth  Defects  by  Maternal  Alcohol  Consumption 

Graubard  and  Korn  (1987)  used  Table  3.8  to  illustrate  the  potential  dependence.  It  refers  to 
a  prospective  study  of  maternal  drinking  and  birth  defects.  After  the  first  three  months  of 
pregnancy,  the  women  in  the  sample  completed  a  questionnaire  about  alcohol  consumption. 
Following  childbirth,  observations  were  recorded  on  the  presence  or  absence  of  congenital 
sex  organ  malformations.  When  a  variable  is  nominal  but  has  only  two  categories,  statistics 
that  treat  it  as  ordinal  are  still  valid.  For  instance,  we  can  artificially  regard  malformation 
as  ordinal,  treating  “present”  as  “high”  and  “absent”  as  “low.”  With  only  two  rows,  any  set 
of  distinct  row  scores  is  a  linear  transformation  of  any  other  set  and  gives  the  same  M 2 
value.  Alcohol  consumption,  measured  as  the  average  number  of  drinks  per  day,  is  an  ordinal 
explanatory  variable.  This  groups  a  naturally  continuous  variable,  and  we  first  use  the  scores 
{v’i  =  0,  V2  =  0.5,  v’3  =  1 .5,  V4  =  4.0,  v5  =  7.0),  the  last  score  being  somewhat  arbitrary. 
For  this  choice,  M2  =  6.57,  for  which  the  P-value  is  0.010.  By  contrast,  for  the  equally 
spaced  row  scores  ( 1 , 2,  3, 4, 5),  M 2  =  1.83,  giving  a  much  weaker  conclusion  (P  =0.18). 

An  alternative  approach  uses  the  data  to  form  the  scores  automatically,  with  ranks  as 
the  category  scores.  All  subjects  in  a  category  receive  the  average  of  the  ranks  that  would 
apply  for  a  complete  ranking  of  the  sample  from  1  to  n.  These  are  called  midranks.  When 
X  and  Y  are  both  ordinal  and  M2  uses  midrank  scores,  the  correlation  on  which  M 2  is 
based  is  called  Spearman's  rho.  For  Table  3.8,  the  17,1 14  subjects  at  level  0  for  alcohol 
consumption  share  ranks  1  through  17,114.  Each  receives  the  average  of  these  ranks, 
which  is  the  midrank  (1  +  17,1 14)/2  =  8557.5.  Similarly,  the  midranks  for  the  last  four 
categories  are  24,365.5,  32,013,  32,473,  and  32,555.5.  These  scores  yield  M2  =  0.35  and 
a  weaker  conclusion  yet  (P  =  0.55). 

Why  does  this  happen?  Adjacent  categories  having  relatively  few  observations  neces¬ 
sarily  have  similar  midranks.  The  midranks  are  similar  for  the  final  three  categories,  since 
those  categories  have  few  observations  compared  with  the  first  two  categories.  This  scor¬ 
ing  scheme  treats  alcohol  consumption  level  1-2  drinks  (category  3)  as  much  closer  to 
consumption  level  >6  drinks  (category  5)  than  to  consumption  level  0  drinks  (category  1). 


Table  3.8  Data  for  Which  Test  Results  Depend  Greatly  on  Scores 
for  Alcohol  Consumption 


Alcohol  Consumption 

(average  number  of  drinks  per  day) 

Malformation 

0  <  1 

1-2  3-5  >  6 

Absent 

17,066  14,464 

788  126  37 

Present 

48  38 

5  1  1 

Source:  Reprinted  with  permission  from  the  Biometric  Society  (Graubard  and 
Korn  1987). 


6See  Note  5.7  for  efficiency  results  when  one  variable  is  binary. 
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This  seems  inappropriate.  It  is  usually  better  to  select  scores  that  reflect  perceived  distances 
between  categories.  When  uncertain  about  this  choice,  a  sensitivity  analysis  should  be 
performed,  selecting  two  or  three  sensible  choices  and  checking  whether  results  are  similar. 
Equally  spaced  scores  often  provide  a  reasonable  compromise  when  the  category  labels 
do  not  suggest  obvious  choices,  such  as  the  categories  (liberal,  moderate,  conservative)  for 
political  philosophy. 

3.4.7  Trend  Tests  for  /  x  2  and  2x7  Tables 

When  /  or  J  equals  2,  the  tests  based  on  linear  or  monotonic  trend  simplify  to  well- 
established  procedures.  With  binary  1,2x7  tables  occur  in  comparisons  of  two  groups, 
such  as  when  the  rows  represent  two  treatments.  Using  scores  {u  \  =  0,  «2  =  1)  for  levels 
of  X,  the  covariation  measure  UjVj  p,j  in  M 2  simplifies  to  J2j  vj Pij ■  Divided  by  the 

proportion  of  subjects  in  row  2,  it  gives  the  mean  score  for  that  row.  In  fact,  M2  is  then 
directed  toward  detecting  differences  between  the  two  row  means  of  the  scores  on  Y. 

With  midrank  scores  for  Y,  the  test  using  M2  for  2  x  J  tables  is  sensitive  to  differences 
in  mean  ranks  for  the  two  rows.  This  test  is  called  the  Wilcoxon  or  Mann-Whitney  test. 
The  large-sample  version  of  that  test  uses  a  standard  normal  z  statistic  that  is  equivalent 
to  z  =  (C  —  D)/SE()  based  on  the  difference  between  the  numbers  of  concordant  and 
discordant  pairs  relative  to  the  null  SE.  The  square  of  the  statistic  is  equivalent  to  M2, 
using  arbitrary  row  scores  and  midranks  for  the  columns.  For  summarizing  the  difference 
between  the  two  groups,  related  measures  such  as  A  =  P(Y\  >  Y2)  —  P{Yi  >  Ti)  are  also 
relevant  (Section  2.4.6).  Ryu  and  Agresti  (2008)  proposed  score-type  confidence  intervals 
for  such  measures. 

When  Y  has  two  levels,  the  table  has  size  1x2.  The  linear  trend  statistic  then  refers 
to  a  linear  trend  in  the  probability  of  either  response  category,  such  as  the  probability  of 
malformation  as  a  function  of  alcohol  consumption.  The  test  in  that  case,  often  called  the 
Cochran-Armitage  trend  test,  is  presented  in  Section  5.3.5. 

3.4.8  Nominal-Ordinal  Tables 

Inference  using  measures  such  as  the  correlation  and  gamma  is  appropriate  when  both 
classifications  are  ordinal.  When  one  is  nominal  with  more  than  two  categories,  other 
statistics  are  needed.  One  is  based  on  summarizing  the  variation  among  means  on  the 
ordinal  variable  in  the  various  categories  of  the  nominal  variable.  We  defer  discussion  of 
this  case  to  Note  3.7,  Exercise  3.37,  and  Section  8.4.3. 


3.5  SMALL-SAMPLE  INFERENCE  FOR  CONTINGENCY  TABLES 

The  inferential  methods  of  the  preceding  four  sections  are  large-sample  methods.  When  n 
is  small,  alternative  methods  use  exact  small-sample  distributions  rather  than  large-sample 
approximations.  In  this  section  we  describe  small-sample  tests  of  independence,  starting 
with  one  that  R.  A.  Fisher  proposed  for  2x2  tables. 

3.5.1  Fisher’s  Exact  Test  for  2  x  2  Tables 

In  Section  16.5. 1  we  show  that,  under  Hq:  independence,  conditioning  on  the  marginal  totals 
of  the  contingency  table  produces  a  null  distribution  for  the  cell  counts  that  does  not  depend 
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on  unknown  parameters.  Usually  both  margins  are  not  naturally  fixed.  For  Poisson  sampling 
nothing  is  fixed,  for  multinomial  sampling  only  n  is  fixed,  and  for  independent  binomial 
row  samples  only  the  row  marginal  totals  are  fixed.  In  any  of  these  cases,  conditioning  on 
both  sets  of  marginal  totals  in  a  2  x  2  table  yields  the  hypergeometric  distribution 


p{t)  =  P(nu  =t)  = 


(3.17) 


This  formula  expresses  the  distribution  of  {«,,}  in  terms  of  only  n\\.  Given  the  marginal 
totals,  n\\  determines  the  other  three  cell  counts.  The  range  of  possible  values  for  n\\  is 
w_  <  «u  <  m+,  where  m-  =  max(0,  n\+  +  n+\  —  n )  and  m+  =  min(«i  +  ,  n+\). 

For  2x2  tables,  independence  is  equivalent  to  the  odds  ratio  0  =  1.  To  test  Hq:  0  =  1, 
the  P-value  is  the  sum  of  certain  hypergeometric  probabilities.  To  illustrate,  consider 
Ha:  0  >  1.  For  the  given  marginal  totals,  tables  having  larger  n\  \  have  larger  sample  odds 
ratios  and  hence  stronger  evidence  in  favor  of  Ha.  Thus,  the  P- value  equals  P{ri\\  >  t0), 
where  t0  denotes  the  observed  value  of  n  n .  This  test  for  2x2  tables  is  called  Fisher’s  exact 
test. 


3.5.2  Example:  Fisher’s  Tea  Drinker 

R.  A.  Fisher  (1935a)  described  the  following  experiment  from  his  days  working  at  Rotham- 
sted  Experimental  Station,  an  agriculture  research  lab  north  of  London.  Dr.  Muriel  Bristol, 
a  colleague  of  Fisher’s,  claimed  that  when  drinking  tea  she  could  distinguish  whether  milk 
or  tea  was  added  to  the  cup  first  (she  preferred  milk  first).  To  test  her  claim,  Fisher  asked 
her  to  taste  eight  cups  of  tea,  four  of  which  had  milk  added  first  and  four  of  which  had  tea 
added  first.  She  knew  there  were  four  cups  of  each  type  and  had  to  predict  which  four  had 
the  milk  added  first.  The  order  of  presenting  the  cups  to  her  was  randomized. 

Table  3.9  shows  a  possible  result.  Distinguishing  the  order  of  pouring  better  than  with 
pure  guessing  corresponds  to  6  >  1,  reflecting  a  positive  association  between  order  of 
pouring  and  the  prediction.  We  conduct  Fisher’s  exact  test  of  Ho:  0  =  1  against  Ha:  6  >  1. 

The  experimental  design  fixed  both  marginal  distributions,  since  Dr.  Bristol  had  to 
predict  which  four  cups  had  milk  added  first.  Thus,  the  hypergeometric  applies  naturally 
for  the  null  distribution  of  n\\.  The  P-value  for  Fisher’s  exact  test  is  the  null  probability 
of  Table  3.9  and  of  tables  having  even  more  evidence  in  favor  of  her  claim.  The  observed 


Table  3.9  Data  for  Fisher’s  Tea-Tasting  Experiment 


Guess  Poured  First 

Poured  First 

Milk 

Tea 

Total 

Milk 

3 

1 

4 

Tea 

1 

3 

4 

Total 

4 

4 

Source:  Based  on  experiment  described  by  Fisher  (1935a). 


92 


INFERENCE  FOR  TWO-WAY  CONTINGENCY  TABLES 


table,  t„  —  3  correct  choices  of  the  cups  having  milk  added  first,  has  null  probability 

(J)0)/0)-am 

The  only  table  that  is  more  extreme  in  the  direction  of  Ha  has  n\\  —4  correct.  It  has  a 
probability  of  0.014.  The  /’-value  is  P(n\\  >  3)  =  0.243.  This  result  does  not  establish  an 
association  between  the  actual  order  of  pouring  and  her  predictions.  According  to  Fisher’s 
daughter  (Box  1 978,  p.  1 34),  in  reality  Dr.  Bristol  did  convince  Fisher  of  her  ability. 

3.5.3  Two-Sided  P-Values  for  Fisher’s  Exact  Test 

For  the  one-sided  alternative,  the  same  P-value  results  using  tables  ordered  according  to 
larger  ri\\,  larger  odds  ratio,  or  larger  difference  of  proportions  (Davis  1986a).  For  the 
two-sided  alternative,  different  criteria  can  have  different  /’-values. 

For  a  two-sided  P- value,  the  most  common  approach  (Irwin  1935)  sums  P(n\\  =  t)  in 
(3.17)  for  counts  t  such  that  p(t)  <  p(t0 );  that  is.  the  P-value  is  P[p(n\\)  <  p(t(,)]  for  the 
observed  value  ta.  Another  possibility  sums  p(t)  for  tables  that  are  farther  from  //();  that  is. 


P-value  =  P[\nw  -  £(«n)|  >  1 t„  -  £(/!M)|], 


where  the  hypergeometric  £(«u)  =  n\+n+\/n.  This  is  identical  to  P(X2  >  X2)  for  ob¬ 
served  Pearson  statistic  X20.  A  third  approach  doubles  the  minimum  one-sided  P-value,  that 
is,2min[£(«n  >  ta),  P{n\\  <  ?„)],  but  this  can  exceed  1 .  A  fourth  approach  (Blaker  2000) 
uses  Q  =  min[£(«n  >  ?„),  P(nn  <  /„)]  plus  an  attainable  probability  in  the  other  tail  that 
is  as  close  as  possible  to,  but  not  greater  than,  that  one-tailed  probability.  This  P-value  can 
be  expressed  as  P(Q  <  qu)  for  observed  value  qu  of  Q. 

Each  approach  has  advantages  and  disadvantages  (3.10).  They  can  provide  different 
results  because  of  the  discreteness  and  potential  skewness.  The  approach  of  ordering  tables 
by  a  distance  measure  from  //(>,  such  as  X2,  extends  naturally  to  /  x  J  tables.  Exact  tests 
for  that  more  general  case  are  deferred  to  Section  16.5.2. 

In  practice,  two-sided  tests  are  more  common  than  one-sided.  Partly  this  is  so  that 
researchers  can  avoid  charges  of  bias  in  giving  evidence  that  supports  their  predicted 
direction  for  an  effect.  To  conduct  a  test  of  size  0.05  when  you  truly  believe  that  the  effect 
has  a  particular  direction,  you  can  conduct  the  one-sided  test  at  the  0.025  level  to  guard 
against  criticism.  For  instance,  in  the  1998  document  Biostatistical  Principles  for  Clinical 
Trials ,  the  International  Conference  on  Harmonization  (ICH  E9)  stated:  “The  approach  of 
setting  type  I  errors  for  one-sided  tests  at  half  the  conventional  type  I  error  used  in  two- 
sided  tests  is  preferable  in  regulatory  settings.  This  promotes  consistency  with  two-sided 
confidence  intervals  that  are  generally  appropriate  for  estimating  the  possible  size  of  the 
difference  between  two  treatments.” 


3.5.4  Confidence  Intervals  Based  on  Conditional  Likelihood 

Small-sample  methods  also  apply  to  estimation.  An  approach  discussed  in  Section  7.3  uses 
a  conditional  likelihood  function  to  eliminate  nuisance  parameters  by  conditioning  on  their 
sufficient  statistics.  This  is  useful  even  with  large-sample  confidence  intervals.  In  fact,  the 
intervals  for  the  odds  ratio  in  Sections  3.2.5  and  3.2.6  that  utilize  joint  distributions  with 
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fixed  column  margins  as  well  as  row  margins  are  actually  conditional  score  and  profile 
likelihood  confidence  intervals. 

Small-sample  interval  estimation  entails  other  complicating  issues  resulting  from  dis¬ 
creteness  and  evaluating  performance  over  an  entire  parameter  space.  We  defer  discussion 
of  those  methods  to  Sections  16.6.4  and  16.6.8. 

3.5.5  Discreteness  and  Conservatism  Issues 

The  hypergeometric  distribution  (3.17)  is  highly  discrete  for  small  samples,  because  n\\ 
and  hence  the  P-value  can  assume  relatively  few  values.  It  is  usually  not  possible  to  achieve 
a  fixed  significance  level  (size)  such  as  0.05. 

In  the  tea-tasting  experiment,  for  instance,  n  1 1  can  equal  only  4,  3,2,  1 , 0.  The  one-sided 
P-values  are  restricted  to  0.014,  0.243,  0.757,  0.986,  and  1.0.  If  we  reject  Hq  when  the 
P-value  <  0.05,  then  0.05  is  not  the  probability  of  type  I  error.  Only  the  P-value  of  0.014 
does  not  exceed  0.05;  thus,  when  Hq  is  true,  the  probability  of  falsely  rejecting  it  is  0.014, 
not  0.05.  In  this  sense,  the  traditional  approach  to  hypothesis  testing  is  conservative:  The 
true  probability  of  type  I  error  is  bounded  above  by  the  nominal  level. 

It  is  possible  to  achieve  any  fixed  significance  level  by  data-unrelated  randomization  on 
the  boundary  of  the  critical  region,  in  deciding  whether  to  reject  Hq.  For  the  tea-tasting 
experiment,  suppose  that  we  reject  Hq  when  n\\  =  4,  we  reject  Hq  with  probability  0.157 
when  «n  —  3,  and  we  do  not  reject  Hq  otherwise;  that  is,  when  n  \  \  =3,  we  generate 
a  uniform  random  variable  U  over  [0,  1]  and  reject  Hq  if  U  <  0.157.  For  expectation 
taken  with  respect  to  the  null  hypergeometric  distribution  of  n  \  \ ,  the  significance  level  then 
equals 


/’(reject  Hq)  =  E[P( reject  A/«|«i i )] 

=  1 .0(0.01 4)  +  0.1 57(0.229)  +  0.0  x  P(n M  <  2)  =  0.05. 

With  the  randomization  extension,  Tocher  (1950)  showed  that  Fisher’s  test  is  uniformly 
most  powerful  unbiased  (UMPU)  for  the  chosen  size  (here,  0.05).  This  property  follows 
from  conditioning  on  a  sufficient  statistic  that  is  complete  and  has  distribution  in  the 
exponential  family  (Lehmann  and  Romano  2005,  Sec.  4.4— 4.7). 

In  practice,  randomization  having  nothing  to  do  with  the  data  is  unacceptable,  and 
sensible  tests  for  this  hypothesis  are  biased  (Exercise  3.43).  We  recommend  simply  reporting 
the  P-value.  To  reduce  conservativeness,  report  the  mid  P-value  (Section  1 .4.4).  The  test  is 
no  longer  guaranteed  to  have  true  P(type  I  error)  no  greater  than  the  nominal  value,  but  in 
practice  it  is  rarely  much  greater.  For  the  one-sided  test  with  the  tea-tasting  data, 

mid  P-value  =  (^)  P{nu  =  3)  +  P{n\\  >  3)  =  0.129. 

For  a  two-sided  test  using  a  criterion  such  as  X 2,  we  would  add  half  the  probability  of  the 
observed  X2  value  to  the  probabilities  for  larger  values,  for  tables  with  the  given  margins. 

3.5.6  Small-Sample  Unconditional  Tests  of  Independence 

A  common  sampling  assumption  for  analyses  comparing  two  groups  on  a  binary  response 
is  that  the  rows  are  independent  binomial  samples.  Then,  only  {/?,+}  are  naturally  fixed.  For 
Poisson  and  multinomial  sampling  schemes,  neither  marginal  distribution  is  fixed.  For  such 
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cases  it  may  seem  artificial  to  condition  on  both  sets  of  marginal  counts.  An  alternative 
small-sample  test,  designed  for  independent  binomial  samples,  conditions  on  only  the  row 
totals. 

Under  binomial  sampling  with  parameter  tt,  in  row  i,  consider  testing  Hq\  tt\  =  7r2 
using  some  test  statistic  T,  such  as  the  Pearson  X2.  For  fixed  {«/+},  T  can  take  a  discrete 
set  of  values,  one  of  which  is  the  observed  value  ta.  Given  Tt\  —  Tt2  =  it,  the  P- value  is 
Pn(T  >  t0),  calculated  using  the  product  of  the  two  binomial  probability  mass  functions. 
This  is  the  sum  of  the  product  binomial  probabilities  for  those  pairs  of  binomial  samples 
that  have  T  >  ta.  Since  n  is  unknown,  the  actual  P- value  is  defined  as 

P  =  sup  P„(T  >  t0).  (3.18) 

0<7T  <1 


This  is  an  unconditional  small-sample  test  of  independence.  Like  Fisher’s  exact  test,  the 
true  size  is  no  greater  than  the  nominal  value  [e.g.,  if  we  reject  when  P  <  0.05,  the  actual 
/’(type  I  error)  <  0.05]. 

We  illustrate  using  test  statistic  X 2  for  the  2  x  2  table  having  entries  (3,  0/0,  3),  by  row, 
with  fixed  row  totals  (3,  3)  as  binomial  sample  sizes.  The  sample  X2  =  6.0.  This  X2  value 
for  the  observed  table  and  for  table  (0,  3  /  3,  0)  is  the  maximum  possible.  For  a  given  value 
it  for  it\  =  7T2,  the  probability  of  the  first  table  is  [tr3(l  —  7r)°][7r°(l  —  7f)3]  =  7T3(1  —  n )3 
(3  successes  and  0  failures  in  the  first  row  and  0  successes  and  3  failures  in  the  second), 
the  product  of  two  binomial  probabilities.  Similarly,  the  probability  of  the  second  table  is 
(1  -  7r)37r3.  Thus,  the  conditional  P-value  is  P„(X2  >  6)  =  2^3(1  —  7r)3,  the  sum  of  the 
product  binomial  probabilities  for  those  two  tables.  The  supremum  of  this  over  0  <  n  <  1 
occurs  at  jr  =  5,  giving  overall  P-value  equal  to  2(0.5)3(0.5)3  =  0.031.  By  contrast,  the 

two-sided  Fisher’s  exact  test  has  P- value  equal  to  2  (0^(3)  /  (3)  =  0. 100. 

Barnard  (1945,  1947)  was  the  first  to  propose  an  unconditional  test  comparing  binomial 
parameters,  although  he  later  (1949)  withdrew  it  in  favor  of  Fisher’s  exact  test.  His  method 
forms  a  critical  region  by  adding  points  according  to  certain  criteria  until  the  supremum 
over  tt  of  the  probability  of  points  in  the  region  is  as  close  to  the  desired  size  as  possible. 
Several  authors  have  since  proposed  related  tests.  Suissa  and  Shuster  (1985)  used  (3.18) 
with  T  as  the  pooled  or  unpooled  z  statistic  for  comparing  two  proportions,  which  give 
identical  P- values  when  n  |+  =  «2+.  Haber  (1986)  also  used  the  pooled  statistic.  Boschloo 
(1970)  suggested  using  an  increased  significance  level  for  the  conditional  test  considered 
for  all  the  possible  response  marginal  distributions,  such  that  the  unconditional  size  at  each 
value  of  Tt  under  Hq  (averaged  over  the  possible  marginal  distributions)  is  no  greater  than 
the  nominal  level.  This  essentially  uses  the  P-  value  from  Fisher’s  exact  test  as  a  test  statistic 
(Mehrotra  et  al.  2003,  Lin  and  Yang  2009).  That  is,  the  P- value  for  Boschloo’s  test  is  the 
supremum  over  it  of  the  product  binomial  probability  that  the  Fisher’s  P-  value  is  less  than 
or  equal  to  the  observed  Fisher  P-  value.  The  Boschloo  test  is  necessarily  at  least  as  powerful 
as  Fisher’s  test  since  its  rejection  region  contains  that  for  Fisher’s  test. 

3.5.7  Conditional  Versus  Unconditional  Tests 

Since  Barnard  introduced  the  unconditional  test,  many  statisticians  have  debated  the  proper 
way  to  conduct  small-sample  analyses  of  2  x  2  tables.  Fisher  criticized  the  uncondi¬ 
tional  approach,  arguing  that  possible  samples  with  quite  different  numbers  of  successes 
than  observed  were  not  relevant.  In  Fisher’s  (1945)  view,  “. .  .the  existence  of  these  less 
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informative  possibilities  should  not  affect  our  judgment  of  significance  based  on  the  series 
actually  observed  ....  The  fact  that  such  an  unhelpful  outcome  as  these  might  occur  . . . 
is  surely  no  reason  for  enhancing  our  judgment  of  significance  in  cases  where  it  has  not 
occurred; ...  it  is  only  the  sampling  distribution  of  samples  of  the  same  type  that  can  supply 
a  rational  test  of  significance.” 

An  adaptation  of  the  unconditional  approach  by  Berger  and  Boos  (1994)  addresses  this 
criticism  somewhat.  They  took  the  supremum  for  the  P-value  over  a  confidence  interval  of 
values  for  the  nuisance  parameter  n  rather  than  over  all  possible  values.  Their  unconditional 
P-value  is 


p  =suPxecyp*(T  >t0)  +  y, 

where  Cr  is  a  confidence  interval  for  it  with  coverage  probability  at  least  (1  —  y)%.  Here, 
y  is  taken  to  be  very  small  (e.g.,  0.001),  and  the  test  maintains  the  guaranteed  upper  bound 
on  size. 

Other  arguments  in  favor  of  conditioning  on  both  sets  of  marginal  totals  are  that  the 
conditional  approach  provides  a  simple  way  to  eliminate  nuisance  parameters  that  general¬ 
izes  to  many  other  contingency  table  problems,  and  the  margins  contain  little  information 
about  the  association  (see  Note  3.9).  In  an  informal  sense,  the  margins  can  contain  much 
more  information  about  the  range  of  plausible  values  for  the  difference  of  proportions  than 
the  odds  ratio,  as  illustrated  by  a  2  x  2  comparison  of  two  groups  of  size  50  each  when  the 
total  sample  contains  only  1  success.  Arguments  against  conditioning  partly  concern  the 
increased  discreteness  that  occurs.  The  few  possible  values  for  n\\  make  it  difficult  to 
obtain  a  small  P-value.  In  repeated  use  with  a  nominal  significance  level,  the  actual  type  I 
error  probability  may  be  much  smaller  than  the  nominal  value  and  the  power  may  suffer. 
Finally,  for  inference  about  nonnull  values  (e.g.,  confidence  intervals),  we  will  see  (Section 
16.6)  that  the  conditional  approach  applies  for  the  odds  ratio  but  not  for  other  association 
measures. 

The  conservatism  issue  is  partly  unavoidable.  Statistics  having  discrete  distributions 
are  necessarily  conservative  in  terms  of  achieving  nominal  significance  levels.  Because 
an  unconditional  test  fixes  only  one  margin,  however,  it  has  many  more  tables  in  the 
reference  set  for  its  sampling  distribution.  That  distribution  is  less  discrete,  and  a  richer 
array  of  possible  P-values  occurs  than  with  Fisher’s  exact  test.  An  unconditional  test  tends 
to  be  less  conservative  and  more  powerful  than  Fisher’s  exact  test.  A  disadvantage  is  that 
computations  are  very  intensive  for  more  complex  problems,  such  as  tables  larger  in  size 
than  2x2. 

If  a  table  truly  has  two  independent  binomial  samples,  the  unconditional  approach 
seems  sensible.  See  Kempthorne  (1979)  for  a  cogent  argument.  The  conditional  approach 
is  useful  for  other  cases,  such  as  with  convenience  samples  in  experiments.  In  a  randomized 
clinical  trial,  a  sample  of  n  available  subjects  is  randomly  allocated  to  two  treatments. 
The  samples  are  not  binomials,  as  they  are  not  random  samples  from  two  populations 
of  interest.  We  could  focus  on  the  sample  alone  and  consider  the  probability  of  a  result 
at  least  as  extreme  as  observed  if  there  truly  is  no  treatment  effect.  For  instance,  out  of 
all  possible  ways  of  choosing  ri\+  of  the  n  subjects  for  treatment  1,  for  what  proportion 
of  samples  would  ti\\  be  at  least  as  large  as  observed?  Under  the  null  hypothesis  of  no 
treatment  effect,  the  same  overall  response  distribution  (n+ 1 ,  n+ 2)  of  successes  and  failures 
occurs  regardless  of  the  allocation  of  subjects  to  treatments.  Thus,  the  column  margin  is 
also  naturally  fixed.  This  argument  leads  to  hypergeometric  null  probabilities  and  Fisher’s 
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exact  test  (Greenland  1991)  or  its  mid-/5  adaptation.  This  randomization-based  approach 
does  not  extend,  however,  to  nonnull  effect  values  and  hence  to  confidence  intervals. 

Sometimes  both  sets  of  marginal  totals  are  naturally  fixed,  such  as  in  Table  3.9.  Then, 
the  high  degree  of  discreteness  is  unavoidable  and  it  is  natural  to  use  Fisher’s  exact  test  and 
its  mid-/5  adaptation. 


3.6  BAYESIAN  INFERENCE  FOR  TWO-WAY  CONTINGENCY  TABLES 

Bayesian  methods  are  relatively  straightforward  for  estimating  cell  probabilities  or  as¬ 
sociation  measures  for  contingency  tables.  When  we  distinguish  between  response  and 
explanatory  variables,  we  treat  the  cell  counts  as  realizations  of  independent  multinomial 
samples  and  formulate  prior  distributions  (such  as  Dirichlet  distributions)  for  the  multino¬ 
mial  parameters.  When  both  variables  are  response  variables,  we  can  treat  the  cell  counts 
as  a  single  multinomial  sample  and  formulate  a  prior  distribution  for  the  entire  set  of  cell 
probabilities. 

3.6.1  Prior  Distributions  for  Comparing  Proportions  in  2  x  2  Tables 

We  consider  first  the  comparison  of  parameters  for  two  independent  binomial  samples 
summarized  in  a  2  x  2  contingency  table.  Let  T,  denote  a  binomial  bin(rc,  ,  tt, )  variate, 
i  =  1,2. 

The  conjugate  Bayesian  approach  uses  independent  beta(a,  i ,  a, 2)  prior  densities  for  777, 
i  =  l,2.  This  yields  independent  posterior  beta(y;  +an,n  —  y,  +  a, 2)  densities  for  7 r,-, 
i  =  1,2. 

With  independent  continuous  prior  densities,  the  prior  and  posterior  probability  of 
homogeneity,  7r  1  =  7T2,  is  zero.  To  allow  P{n\  =  7T2|yi,  n\\yi,  ni)  >  0,  we  could  use  a  prior 
distribution  that  has  a  positive  P(n\  —  nj)-  For  example,  we  could  use  a  prior  distribution 
for  which  7Ti  and  7T2  have  beta((*i,  0*2)  distributions,  such  that  7r  1  =  712  with  probability  y 
and  71 1  is  independent  of  712  with  probability  1  —  y. 

Even  if  we  use  a  model  having  P{ti\  =  712)  =  0,  in  practice,  it  is  possible  to  treat  7T)  and 
7i2  as  dependent,  a  priori.  For  instance,  if  we  knew  that  7T]  =  0.02,  then  in  many  applications 
conditionally  this  would  induce  the  subjective  belief  that  712  is  also  close  to  0.  Howard 
(1998)  suggested  a  prior  distribution  for  correlated  (7Ti ,  772).  For  odds  ratio  6,  he  amended 
the  independent  beta  prior  distributions  by  using  prior  density  function  proportional  to 

e-(l/2a2)[logW]27r«,.-l(|  -Tr,,"12- 'tt*2'  — 'd  -*2)““"'. 

The  correlation  decreases  as  a  increases,  with  independent  beta  densities  resulting  in  the 
limit  as  <7  —>  00.  For  the  amended  Jeffreys  prior  distribution  (a,  1  =  otn  =  0.50),  when  a  = 
(1,  2,  3)  the  correlations  are  (0.84,  0.59,  0.41)  between  7Ti  and  m  (Agresti  and  Min  2005a). 

An  alternative  prior  distribution  that  can  incorporate  correlation  is  a  bivariate  normal 
distribution  for  [logit(7T|),  logit^)].  Taking  marginal  means  of  0  and  standard  deviations 
of  about  3  is  relatively  uninformative.  In  that  case,  when  corr[logit(7Ti),  logit^)]  =  0.50, 
corr(7T|,  712)  =  0.45.  This  case  gives  similar  results  as  Howard’s  prior  with  cr  =  3.  An 
alternative  way  of  inducing  dependence  is  to  use  a  hierarchical  prior,  but  that  is  beyond  our 
scope  here. 
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3.6.2  Posterior  Probabilities  Comparing  Proportions 

For  the  conjugate  prior  structure  of  independent  beta  densities,  independent  beta(y,  + 
an,  n  —  \>i  +  or, 2)  posterior  densities  determine  inference  about  77, ,  i  —  1, 2.  We  can  eval¬ 
uate  through  simulation  or  numerical  integration  relevant  posterior  probabilities  such  as 
P{tt\  <  7i2\y\,n\\y2,ri2)  and  P(n \  >  H2\y\,  «i;  ”2)  and  construct  posterior  intervals 

for  summary  measures  such  as  77 1  —  772  and  the  odds  ratio. 

In  many  applications,  one  group  (say,  group  1 )  receives  a  new  treatment  and  group  2 
receives  some  standard  treatment  or  placebo,  and  the  purpose  of  the  study  is  to  analyze 
whether  the  response  tends  to  be  better  with  the  new  treatment.  Then,  in  terms  of  “success” 
probabilities,  we  could  regard  77  \  <  712  as  a  null  condition  and  77 1  >  712  as  an  alternative. 
Then,  P(  n\  <  H2\y\,  n\',y2,  ^2)  is  a  sort  of  Bayesian  P- value.  Howard  (1998)  showed  that 
with  use  of  Jeffreys  priors  with  <2, 1  =  a, 2  =  0.5,  P(ji\  <  r^lyi,  n  1 ;  y2,  >12)  approximately 
equals  the  one-sided  P- value  for  the  large-sample  z  test  (3.12)  (which  is  the  signed  square 
root  of  the  Pearson  X2  statistic)  fortesting  Hq  :  77 1  =  H2  against  Ha  :  n\  >  772. 

For  testing  Hq  :  7t\  <712  against  zr  1  >  712,  Altham  (1969)  showed  how  the  posterior 
P(ji\  <  n2\y\,n{-,y2,n2)  relates  to  the  one-sided  P-value  for  Fisher’s  exact  test.  With 
prior  beta  hyperparameters  an  =  a, 2  =  y,  i  =  1,2,  and  0<  Y  <  1,  Altham  showed  that 
P {71 1  <  7T2 1 Xi  ,  n\ ;  y2>  n2 )  is  smaller  than  the  Fisher  P-value  by  no  more  than  the  null  prob¬ 
ability  of  the  observed  data.  They  are  identical  when  we  use  the  improper  prior  densities 
with  (ty  1 1 , of  12)  =  (1,  0)  and  (o^i.o^)  =  i)-  These  priors  favor  Hq,  in  effect  penaliz¬ 

ing  against  concluding  that  zr  1  >  712-  For  example,  for  any  data  having  the  same  sample 
proportion  of  successes  0  <  p  <  1  for  the  two  groups,  P(zr  1  <  ^lyi ,  n\\ y2,  /Z2)  >  0.50. 
So,  Fisher’s  one-sided  exact  P- value  corresponds  to  a  Bayesian  inference  with  conservative 
prior  distributions. 

3.6.3  Posterior  Intervals  for  Association  Parameters 

We  could  also  use  the  Bayesian  approach  to  construct  posterior  intervals  for  the  differ¬ 
ence  of  proportions,  relative  risk,  and  odds  ratio.  Any  particular  prior  distribution  for 
(7Ti,7T2)  induces  corresponding  prior  distributions  for  the  measures  themselves.  For  in¬ 
stance,  with  independent  uniform  prior  distributions  for  71  \  and  772, 7ir  —712  has  a  triangular 
density  over  (—1,  +1),  and  the  log  odds  ratio  has  the  Laplace  density  (Nurminen  and 
Mutanen  1987). 

Similarly,  a  posterior  distribution  for  (771,772)  induces  posterior  distributions  for  the 
measures.  For  the  case  of  independent  beta  posterior  distributions  for  77 1  and  772,  we  can 
easily  simulate  the  posterior  distribution  of  a  measure  of  association  by  simulating  values 
from  the  beta  distributions.  Thus,  it  is  easy  to  simulate  reasonable  approximations  for 
posterior  intervals,  for  instance,  forming  the  95%  equal-tail  interval  using  values  between 
the  simulated  2.5  percentile  and  97.5  percentile  of  the  posterior  distribution. 

Finding  more  precise  intervals  requires  better  approximations.  Let  Fm(t)  denote  the  cdf 
for  the  posterior  distribution  of  a  generic  measure  of  association  a>.  In  terms  of  independent 
beta  posterior  densities  /(7r,  |  y, ,  n,  )  for  77,  ,  i  —  1,2, 


Fa{t)  = 


I  y\,n\)f(7i2  |  y2,  n2)dn\d7T2, 
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where  S,  —  {(70,  7x2) :  co  <  t,  0  <  7X\,  7x2  <  1}.  The  equal-tail  95%  posterior  interval7 
(L,  U)  satisfies  F01(L)  =  0.025  and  F^iU)  =  0.975.  Hashemi  et  al.  (1997)  and  Nurmi- 
nen  and  Mutanen  (1987)  gave  integral  expressions  for  the  posterior  distributions  of  the 
difference  of  proportions,  relative  risk,  and  odds  ratio. 

To  obtain  good  frequentist  performance  in  terms  of  maintaining  coverage  probability 
close  to  the  nominal  level  over  the  entire  parameter  space,  it  is  best  to  use  quite  diffuse  priors. 
Even  uniform  priors  are  often  too  informative.  Agresti  and  Min  (2005a)  recommended,  in 
agreement  with  Brown  et  al.  (2001)  in  the  single  binomial  case,  using  independent  Jeffreys 
priors  for  tx\  and  1x2- 


3.6.4  Example:  Urn  Sampling  Gives  Highly  Unbalanced  Treatment  Allocation 

For  small  samples,  results  can  depend  strongly  on  the  choice  of  prior  distribution,  as  shown 
by  an  example  from  a  clinical  trial  discussed  by  Begg  (1990).  For  an  urn  sampling  method 
to  allocate  patients  to  treatments,  the  1 1  patients  allocated  to  the  experimental  treatment 
were  all  successes  and  the  only  patient  allocated  to  the  control  treatment  was  a  failure.  That 
is,  the  table  has  rows  (11,0)  and  (0,  1). 

The  95%  equal-tail  posterior  interval  for  the  odds  ratio  is  (1.2,  218.4)  for  independent 
beta(2,  2)  priors,  (1.7,  4677)  for  uniform  (beta(l,  1))  priors,  and  (3.3,  1.4  xlO6)  for 
Jeffreys  (beta(0.5,  0.5))  priors.  By  contrast,  the  frequentist  95%  confidence  interval  based 
on  inverting  the  large-sample  score  test  is  (4.5,  00).  Incorporating  prior  beliefs  with  a  mean 
of  no  effect  causes  the  lower  bound  for  the  odds  ratio  to  be  pulled  considerably  toward  the 
no  effect  value  of  1 .0. 

With  uniform  priors,  the  posterior  densities  are  beta(12,  1)  for  7X\  and  beta(l,  2)  for  1x2. 
A  simple  way  to  estimate  precisely  the  posterior  P(tx\  >  7X2\y\,  «i ;  >'2,  ni)  is  to  generate  a 
huge  number  of  beta  random  variables  from  these  two  densities  and  observe  the  proportion 
of  cases  for  which  7X\  >  7x2-  We  then  find  that  P(  tx\  >  7x2  |yi,  n\\y2,  «2)  =  0.99.  There  is 
strong  evidence  that  the  experimental  treatment  is  better  than  control. 


3.6.5  Highest  Posterior  Density  Intervals 

An  alternative  approach  to  constructing  posterior  intervals  uses  highest  posterior  density 
(HPD)  intervals.  When  applied  to  the  odds  ratio  and  relative  risk,  this  method  has  a  serious 
disadvantage:  It  is  not  invariant  under  nonlinear  parameter  transformation.  Specifically, 
suppose  (L,  U)  is  a  95%  HPD  interval  based  on  the  posterior  distribution  of  the  odds  ratio 
6.  Then,  the  95%  HPD  interval  based  on  the  posterior  distribution  of  1  / 0 ,  which  is  relevant 
if  we  reverse  the  labeling  of  the  two  groups  being  compared,  is  not  (1/(7,  1  /L).  In  fact,  it 
can  be  considerably  different.  This  happens  because  the  95%  region  of  highest  density  for 
a  random  variable  X  is  not  the  inverse  mapping  of  the  95%  region  of  highest  density  for 

I/*- 

To  illustrate,  consider  uniform  prior  densities  for  7Ti  and  7x2  when  n\  =112=  10.  When 
y\  =  1  and  y>2  =  5,  0  =  1/9  and  the  Bayes  95%  HPD  interval  for  6  is  (0.0006,  0.82);  when 
y\  —  5  and  y2  =  1,  3  =  9  and  the  Bayes  95%  HPD  interval  is  (0.17,  38.23),  which  is 
very  different  from  (1/0.82,  1/0.0006).  By  contrast,  the  95%  equal-tail  confidence  intervals 

7See  www .  stat .  uf  1 .  edu/~aa/cda/R/bayes/index .  html  for  R  functions  to  construct  pos¬ 
terior  intervals  for  the  odds  ratio,  relative  risk,  and  difference  of  proportions. 
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for  the  odds  ratios  with  uniform  priors  are  (0.017,  1.10)  when  yi  =  1  and  yj  =  5  and 
(0.91  =  1/1.10,  57.9  =  1/0.017)  when  y\  =  5  and  yi  =  1.  In  another  example,  for  tables 
with  9=0,  the  HPD  interval  with  diffuse  priors  is  typically  of  the  form  (0,  U ),  but  when 
rows  are  interchanged  so  that  9  =  oo,  the  HPD  interval  has  a  finite  upper  bound  (Agresti 
and  Min  2005a). 

HPD  invariance  to  group  labeling  does  occur  on  the  log  scale  for  the  odds  ratio  and 
relative  risk,  because  the  relevant  parameter  is  a  difference  (e.g.,  log  odds  ratio  =  difference 
of  log  odds)  and  so  is  linearly  rather  than  nonlinearly  transformed  by  a  relabeling  of  the 
groups.  However,  users  interpret  the  magnitude  of  the  odds  ratio  on  its  original  scale  rather 
than  the  log  scale.  So,  the  lack  of  invariance  when  constructing  HPD  intervals  on  the 
original  scale  is  to  us  a  compelling  reason  not  to  use  the  HPD  approach  for  the  odds  ratio 
or  relative  risk. 

An  exception  when  the  HPD  interval  seems  sensible  is  when  the  posterior  density  is 
monotone.  Then,  excluding  both  upper  and  lower  tails  of  that  distribution  with  the  equal- 
tail  method  seems  inappropriate.  For  example,  suppose  the  sample  odds  ratio  is  0  and  the 
HPD  interval  has  form  (0,  U),  with  the  two  binomials  relabeled  if  necessary  so  this  is  the 
case.  The  HPD  interval  then  seems  more  relevant  than  the  equal-tail  interval.  However, 
then  it  seems  sensible  when  groups  or  outcome  categories  are  interchanged  to  use  the 
corresponding  posterior  interval  (1/(7,  oo),  which  is  not  HPD. 

Consider  the  difference  of  proportions.  When  H\  —  it  2  takes  its  boundary  values  of  +1 
or  - 1,  the  posterior  density  is  monotone  with  the  Jeffreys  prior  or  more  diffuse  priors,  and 
close  to  monotone  for  priors  that  are  more  informative  than  the  Jeffreys  prior  (Agresti  and 
Min  2005a).  So,  the  HPD  interval  for  n \  —  m  seems  sensible.  With  the  Jeffreys  prior,  the 
HPD  interval  then  has  the  form  ( L ,  1)  or  (—  1,  U). 

3.6.6  Testing  Independence 

For  2x2  tables,  Bayesian  inference  about  whethertwo  binary  variables  are  independent  can 
be  based  directly  on  posterior  tail  probabilities  and  intervals  for  association  parameters,  such 
as  we  illustrated  in  Section  3.6.4.  For  /  x  J  tables,  such  inference  is  not  as  straightforward, 
because  independence  relates  to  (/  —  1)(7  —  1)  parameters  instead  of  a  single  parameter. 

One  approach  for  I  x  J  tables  forms  a  Bayes  factor  that  is  a  ratio  comparing  the 
probability  of  the  data  under  ( 1)  Hq.  independence  and  (2)  Ha  \  association  (see  Note  3. 1 1). 
Converting  this  Bayes  factor  to  the  posterior  probability  that  Hq  is  true  requires  choosing 
a  prior  probability  that  H0  is  true.  Naturally,  the  posterior  probability  is  highly  dependent 
on  the  choice  of  this  prior  probability.  Gunel  and  Dickey  (1974)  considered  independence 
in  two-way  contingency  tables  under  the  usual  sampling  models.  Conjugate  gamma  priors 
for  the  Poisson  model  induce  priors  in  each  further  conditioned  model.  They  showed  that 
the  Bayes  factor  for  independence  itself  factors,  highlighting  the  evidence  residing  in  the 
marginal  totals. 

Ultimately,  it  is  more  informative  to  focus  on  estimating  parameters  that  describe  the 
association.  With  two  ordinal  response  variables,  we  could  summarize  the  evidence  about 
a  positive  or  negative  association  as  summarized  by  a  measure  such  as  the  correlation  or 
gamma.  For  example,  using  the  approach  of  Section  1.6.3  of  combining  a  Dirichlet  prior 
distribution  for  cell  probabilities  with  a  multinomial  likelihood  function  yields  a  Dirichlet 
posterior  distribution  for  the  cell  probabilities.  This  induces  a  posterior  distribution  for  the 
ordinal  measure  of  interest,  yielding  a  posterior  interval  and  posterior  probabilities  of  a 
positive  association  and  of  a  negative  association. 
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3.6.7  Empirical  Bayes  and  Hierarchical  Bayesian  Approaches 

Some  methodologists  find  it  appealing  to  treat  parameters  as  random  variables  having 
distributions  but  dislike  the  subjectivity  inherent  in  the  Bayesian  approach  from  selecting  a 
prior  distribution.  An  alternative  way  of  implementing  a  Bayesian  approach  is  to  let  the  data 
suggest  hyperparameter  values  for  use  in  the  prior  distribution.  This  is  called  the  empirical 
Bayes  approach.  Most  commonly,  this  approach  uses  the  prior  hyperparameter  values  that 
maximize  the  marginal  probability  of  the  observed  data,  integrating  out  the  parameters 
with  respect  to  that  prior  (e.g.,  Efron  and  Morris  1975).  A  related  approach  estimates 
the  prior  that  has  Bayes  estimator  with  smallest  total  mean  squared  error  (Exercise  3.46). 
I.  J.  Good  seems  to  have  first  used  an  empirical  Bayesian  approach  with  contingency  tables, 
estimating  hyperparameters  in  gamma  and  log-normal  priors  for  association  factors.  Good 
(1965)  used  it  to  estimate  the  hyperparameter  value  for  a  symmetric  Dirichlet  prior  for 
multinomial  parameters. 

A  disadvantage  of  the  empirical  Bayesian  approach  is  not  accounting  for  the  source  of 
variability  due  to  substituting  estimates  for  prior  hyperparameters.  An  alternative  approach 
not  having  this  disadvantage  is  hierarchical  Bayes,  which  lets  the  prior  hyperparameters 
themselves  have  a  second-stage  prior  distribution.  For  multinomial  data,  for  example,  Good 
(1965,  1976)  noted  that  Dirichlet  priors  do  not  always  provide  sufficient  flexibility.  He  pro¬ 
posed  a  hierarchical  approach  of  specifying  distributions  for  the  Dirichlet  hyperparameters, 
treating  {a, }  in  the  Dirichlet  prior  as  unknown  and  specifying  a  second-stage  prior  for  them. 
This  approach  gains  greater  generality  at  the  expense  of  giving  up  the  simple  conjugate 
Dirichlet  form  for  the  posterior. 

Most  of  the  empirical  Bayes  and  hierarchical  Bayes  literature  refers  to  estimating  mul¬ 
tiple  parameters,  such  as  several  binomial  parameters  { 7r, } .  For  instance,  at  stage  1,  given 
/i  and  a,  we  might  assume  that  {logifirr,)}  are  independent  from  a  N(/x,  a2)  distribution, 
and  at  stage  2  assume  a  highly  disperse  normal  prior  for  p  and  an  inverse  chi-squared 
prior  distribution  for  a2.  Leonard  (1972)  proposed  an  approach  of  this  type,  for  which  the 
posterior  mean  estimate  of  logit(7r, )  is  approximately  a  weighted  average  of  logit( p, )  and 
(logit(pj),  j  ^  /}. 


3.7  EXTENSIONS  FOR  MULTIWAY  TABLES  AND 
NONTABULATEI)  RESPONSES 

The  methods  of  this  chapter  extend  to  multiway  contingency  tables.  For  instance,  tests  of 
independence  for  two-way  tables  extend  to  tests  of  conditional  independence  in  three-way 
tables.  In  future  chapters  we  present  such  methods  with  models  that  provide  a  basis  for 
defining  relevant  parameters  and  their  statistical  inferences. 

3.7.1  Categorical  Data  Need  Not  Be  Contingency  Tables 

Examples  so  far  have  presented  categorical  data  in  the  format  of  contingency  tables.  How¬ 
ever,  this  book  has  broader  focus  than  contingency  table  analysis.  Models  for  categorical 
response  variables  can  have  continuous  as  well  as  discrete  explanatory  variables.  Even 
when  all  or  most  variables  are  categorical,  source  data  files  are  not  usually  contingency 
tables  but  have  the  form  of  a  line  of  data  for  each  subject.  The  first  three  lines  in  a  data  file 
containing  responses  of  a  survey  of  subjects  measuring  gender,  race,  education  (1  =  less 
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than  high  school,  2  =  high  school  or  some  college,  3  =  college  graduate),  and  attitude 
toward  homosexuality  (1  =  tolerant,  2  =  homophobic)  might  be: 


Subject 

Gender 

Race 

Education 

Attitude 

1 

f 

w 

2 

1 

2 

m 

b 

3 

1 

3 

m 

w 

1 

2 

Software  can  read  data  hies  of  this  type  and  then  conduct  analyses  that  may  or  may  not 
involve  forming  contingency  tables. 

In  the  next  chapter  we  introduce  the  modeling  framework  used  in  the  rest  of  the  book. 
All  the  methods  that  we’ve  studied  in  this  chapter  result  from  inferences  for  parameters  in 
simple  versions  of  these  models. 


NOTES 

Section  3.1:  Confidence  Intervals  for  Association  Parameters 

3.1  Standard  errors:  Goodman  and  Kruskal  (1963,  1972)  provided  standard  errors  for  many 
association  measures  and  extended  (3.9)  for  independent  multinomial  sampling.  For  adap¬ 
tations  of  the  Wald  interval  (3.2)  for  log  Q  that  better  handle  zero  cell  counts,  see  Agresti 
(1999)  and  Gart  (1971).  Agresti  and  Caffo  (2000)  showed  that  as  in  the  single-sample  case 
(Exercise  1.25),  the  Wald  interval  (3.4)  for  n\  —  n2  behaves  much  better  after  adding  two 
pseudo-observations  of  each  type  (one  of  each  type  in  each  sample).  Fagerland  et  al.  (2012) 
compared  various  confidence  interval  methods  for  the  difference  of  proportions,  odds  ratio, 
and  relative  risk. 

3.2  Multiple  comparisons:  Agresti  et  al.  (2008)  proposed  a  method  for  multiple  comparisons 
using  effect  measures  for  comparing  proportions  for  g  groups  that  is  an  analog  of  Tukey’s 
method  for  normal  means.  It  is  based  on  applying  the  Studentized  range  distribution  with 
df  =  oo  to  a  set  of  approximately  standard  normally  distributed  score  statistics  constructed  for 
the  pairs  of  groups.  Schaarschmidt  et  al.  (2008)  proposed  simultaneous  confidence  intervals 
for  multiple  contrasts  of  binomial  proportions.  For  discrete  small-sample  distributions,  Tarone 
(1990)  adjusted  the  Bonferroni  method  to  reduce  its  conservatism. 


Section  3.2:  Testing  Independence  in  Two-Way  Contingency  Tables 

3.3  Chi-squared  inoments/approximations:  For  hypergeometric  sampling  for  I  x  J  tables, 
Haldane  (1940)  derived  E(X2)  and  a  complex  formula  for  var(X2);  Dawson  (1954)  provided 
a  simplified  expression.  Lewis  et  al.  (1984)  derived  the  third  central  moment.  For  2x2 
tables,  Pearson  (1947)  and  others  since  then  (e.g.,  Campbell  2007)  suggested  using  the 
multiple  (n  —  1 )/  n  of  the  chi-squared  statistic.  For  discussion  of  the  adequacy  of  chi-squared 
approximations,  see  Cressie  and  Read  (1989),  Read  and  Cressie  (1988)  and  references  therein, 
Koehler  (1986),  Koehler  and  Larntz  (1980),  Lamtz  (1978),  and  Maiste  and  Weir  (2004). 
Diaconis  and  Efron  (1985)  presented  inference  based  on  a  uniform  distribution  over  all 
possible  tables  of  the  same  /,  J,  and  n;  their  volume  test  considers  the  proportion  of  such 
tables  having  X2  <  X2,  for  observed  value  X2. 

3.4  Complex  sampling:  Social  science  applications  often  incorporate  clustering  and/or  strat¬ 
ification.  For  analyses  of  categorical  data  for  complex  sampling  methods  and  correlated 
observations,  see  Bedrick  (1983),  Cerioli  (2002),  Fay  (1985),  Gleser  and  Moore  (1985),  Holt 
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et  al.  (1980),  Koch  et  al.  (1975),  Koehler  and  Wilson  (1986),  LaVange  et  al.  (2001),  Rao 
and  Scott  (1981,  1987),  Rao  and  Thomas  (1988),  Scott  and  Wild  (2001),  Skinner  and  Vallet 
(2010),  Tavare  and  Altham  (1983),  and  methods  of  Chapter  13.  For  example,  Gleser  and 
Moore  (1985)  showed  that  positive  dependence  causes  null  distributions  of  Pearson  statistics 
to  stochastically  increase. 

3.5  Missing  data:  Watson  (1956)  was  an  early  study  of  effects  of  missing  data  in  contingency 
tables.  Lipsitz  and  Fitzmaurice  (1996)  derived  score  tests  of  independence  and  conditional 
independence,  assuming  ignorable  nonresponse,  and  showed  that  the  test  statistics  have  the 
usual  asymptotic  chi-squared  null  distributions.  Fleiss  et  al.  (2003,  Chap.  16),  Little  (2005), 
and  Little  and  Rubin  (2002)  surveyed  ways  of  dealing  with  missing  data. 

3.6  Score  and  profile  likelihood  CIs:  For  other  discussion  of  score  test-based  intervals,  see 
Agresti  (201 1)  and  references  therein,  Agresti  and  Ryu  (2010),  Brown  and  Li  (2005),  Gart 
and  Nam  (1988),  Koopman  (1984),  Lang  (2008),  Miettinen  and  Nurminen  (1985),  Nurminen 
(1986),  and  Exercise  3.27.  Cornfield’s  (1956)  interval  for  the  odds  ratio  utilized  a  conti¬ 
nuity  correction.  That  interval  approximates  a  small-sample  interval  presented  in  Section 
16.6.4.  The  Miettinen  and  Nurminen  (1985)  score  intervals  used  unbiased  variance  estima¬ 
tors.  For  example,  their  nonnull  chi-squared  statistic  for  the  difference  of  proportions  has  form 
[( n  —  l)/n][z(Ao)]2,  so  their  interval  is  slightly  wider.  Cox  and  Snell  (1989,  pp.  51-52)  pre¬ 
sented  the  profile  likelihood  interval  for  the  difference  of  proportions. 


Section  3.4:  Two-Way  Tables  with  Ordered  Classifications 

3.7  Ordinality:  Brown  and  Benedetti  (1977)  provided  null  standard  errors  of  ordinal  measures 
appropriate  for  testing  independence.  Bhapkar  ( 1 968)  and  Yates  ( 1 948)  proposed  statistics 
similar  to  M2  and  statistics  for  singly  ordered  tables.  Graubard  and  Korn  (1987)  listed  14 
tests  for  2  x  J  tables  that  utilize  a  correlation-type  statistic.  See  also  Nair(  1987)  and  Williams 
(1952).  Cohen  and  Sackrowitz  (1992)  evaluated  decision-theoretic  aspects,  such  as  admissi¬ 
bility,  of  tests  based  on  gamma  and  local  log  odds  ratios. 


Section  3.5:  Small-Sample  Inference  for  Contingency  Tables 

3.8  Continuity  correction:  For  early  discussion  of  Fisher’s  exact  test,  see  Fisher  ( 1 934,  1 935a, c), 
Irwin  (1935),  and  Yates  (1934).  Yates  indicated  that  Fisher  suggested  the  hypergeometric 
distribution  to  him  for  an  exact  test.  He  proposed  a  continuity-corrected  version  of  X2  for 
2x2  tables, 
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so  that  the  chi-squared  right-tail  probability  would  better  approximate  the  hypergeometric 
two-sided  P-value  from  Fisher’s  exact  test.  Hitchcock  (2009)  surveyed  arguments  for  and 
against  its  use  by  Yates  and  other  authors.  Since  software  now  makes  Fisher’s  exact  test 
feasible  even  with  large  samples,  this  correction  is  no  longer  needed. 

3.9  Conditional/unconditional:  For  exact  conditional  methods,  Diaconis  and  Sturmfels  (1998) 
and  Rapallo  (2003)  proposed  algebraic  Markov  chain  algorithms  for  sampling  from  the 
relevant  conditional  distributions.  The  controversy  over  conditioning  includes  Barnard  ( 1 945, 
1947,  1949,  1979),  Berkson  (1978),  Cheng  et  al.  (2008),  Fisher  (1956),  Howard  (1998), 
Kempthome  (1979),  Little  (1989),  Lloyd  (1988a),  Pearson  (1947),  Rice  (1988),  Routledge 
(1992),  Suissaand  Shuster  (1984, 1985),  and  Yates  (1934, 1984).  Discussion  of  unconditional 
methods  includes  Agresti  and  Min  (2001),  Berger  and  Boos  (1994),  Lin  and  Yang  (2009). 
Martin  Andres  and  Silva  Mato  (1994)  summarized  and  compared  various  unconditional  tests. 
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They  found  that  the  method  based  on  the  pooled  z  statistic  may  not  perform  as  well  as 
Barnard’s  or  Boschloo’s  test  when  the  sample  sizes  are  very  unbalanced.  Chan  (1998)  and 
Rohmel  and  Mansmann  (1999)  considered  unconditional  tests  of  equivalence.  Zhu  and  Reid 
(1994)  noted  that  some  information  loss  about  the  association  occurs  in  conditioning  on  the 
margins  except  when  0  —  \.  Other  articles  on  this  topic  include  Berkson  (1978),  Crook  and 
Good  (1980),  Gunel  and  Dickey  (1974),  Haber  (1989),  Plackett  (1977),  and  Yates  (1984). 
Agresti  (1992,  2001)  surveyed  small-sample  methods. 

3.10  Two-sided  /’-value,  mid  /’-value:  For  discussion  of  two-sided  P-values  for  Fisher’s  test,  see 
Blaker  (2000),  Davis  (1986a),  Dupont  (1986),  Lloyd  (1988b),  Mantel  (1987b),  and  Yates  and 
discussants  (1984).  For  inference  using  the  mid  P-value,  see  Agresti  and  Gottard  (2007)  and 
references  therein.  Berry  and  Armitage  (1995),  Hirji  (2005,  Sec.  2.5,  2.8,  and  Sec.  2.1 1.1  for 
many  references),  Hwang  and  Yang  (2001),  Routledge  (1994),  Seneta  and  Phipps  (2001), 
Seneta  et  al.  (1999),  Wells  (2010),  and  Yang  et  al.  (2004).  Similar  benefits  can  accrue  from 
alternative  proposed  P- values.  One  approach,  useful  when  several  tables  have  the  same  value 
for  a  test  statistic,  uses  the  table  probability  to  create  a  more  finely  partitioned  sample  space;  for 
tables  having  the  observed  test  statistic  value,  only  those  contribute  to  the  P-value  that  are  no 
more  likely  than  the  observed  table  (Cohen  and  Sackrowitz  1 992,  Kim  and  Agresti  1 995).  This 
depends  on  more  than  the  sufficient  statistic,  and  in  some  cases  a  Rao-Blackwellized  version 
is  the  mid  P-value  (Wells  2010).  Ordinary  P-values  obtained  with  higher-order  asymptotic 
methods  without  continuity  corrections  for  discreteness  yield  performance  similar  to  that  of 
the  mid  P-value  (Brazzale  et  al.  2007,  Pierce  and  Peters  1 999,  Strawderman  and  Wells  1 998). 


Section  3.6:  Bayesian  Inference  for  Two-Way  Contingency  Tables 

3.11  Bayes:  Agresti  and  Hitchcock  (2005)  gave  many  other  references  for  Bayesian  inference  in 
2x2  tables.  Bayes  factors  for  testing  independence  were  considered  by  Albert  ( 1 997),  Casella 
and  Moreno  (2009),  Crook  and  Good  (1980),  Good  (1976),  and  Quintana  (1998).  Altham 
(1969)  used  a  Bayesian  analysis  with  two  ordinal  multinomial  distributions  that  evaluates  the 
extent  of  evidence  about  stochastic  ordering.  For  situations  with  no  prior  information  and 
even  an  unknown  sample  space,  Walley  (1996)  proposed  an  “imprecise  Dirichlet  model”  for 
multinomial  data  for  which  inferences  are  expressed  in  terms  of  posterior  upper  and  lower 
probabilities  that  become  more  precise  as  the  number  of  observations  increases. 


EXERCISES 

Applications 

3.1  A  meta-analysis  (Moore  et  al.,  Lancet  370:  319-328,  2007)  of  studies  on  the  asso¬ 
ciation  between  cannabis  use  (yes,  no)  and  presence  of  psychosis  (yes,  no)  reported 
a  pooled  odds  ratio  estimate  of  1.41,  with  95%  confidence  interval  of  (1.20,  1.65). 
Explain  how  to  interpret  this  interval. 

3.2  For  239  golf  tournaments  on  the  PGA  tour  between  2004  and  2009,  the  economists 
D.  Pope  and  M.  Schweitzer  evaluated  risk  aversion  by  comparing  percentages  of 
putts  made  when  putting  for  a  par  versus  putting  for  a  birdie  {Am.  Econ.  Rev.  101: 
129-157,  2011).  For  2828  pairs  of  putts  taken  from  within  1  inch  of  each  other 
(from  an  average  distance  of  about  50  inches)  in  the  same  tournament,  the  sample 
proportions  made  were  0.835  when  putting  for  birdie  and  0.880  when  putting  for 
par  (thus  avoiding  the  loss  of  a  bogey).  Construct  a  95%  confidence  interval  for  the 
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difference  between  the  proportions  in  a  corresponding  conceptual  population.  State 
assumptions,  and  indicate  a  key  way  they  do  not  apply  for  this  study.  (Chapter  1 1 
presents  more  refined  methods.) 

3.3  Table  3.10  uses  the  GSS  to  cross-classify  a  subject’s  political  party  ID  with  their 
opinion  about  whether  homosexuals  should  have  the  right  to  marry,  for  subjects  hav¬ 
ing  strong  identification  with  a  particular  party  and  strong  agreement  or  disagreement 
with  homosexual  marriage.  Show  that  (a)  log((9)  =  3.728,  (b)  its  standard  error  is 
0.746,  and  (c)  the  Wald  95%  confidence  interval  for  6  is  (9.6,  179.3).  Name  the  main 
factor  that  causes  this  interval  estimate  to  be  so  imprecise. 


Table  3.10  Opinion  on  Homosexual  Marriage  by  Political  Party, 
for  Exercise  3.3 


Political 

Homosexuals  Should  Have  Right  to  Marry 

Party 

Strongly  Agree 

Strongly  Disagree 

Strong  Democrat 

60 

44 

Strong  Republican 

2 

61 

Source:  2010  General  Social  Survey. 


3.4  For  Table  2.10  on  seat-belt  use  and  results  of  auto  accidents,  find  and  interpret  95% 
confidence  intervals  for  the  conceptual  population  (a)  odds  ratio,  (b)  difference  of 
proportions,  and  (c)  relative  risk. 

3.5  Refer  to  Table  2.5  on  lung  cancer  and  smoking.  Conduct  an  inferential  analysis,  and 
interpret  results. 

3.6  A  study  considered  the  effect  of  prednisolone  on  severe  hypercalcemia  in  women 
with  metastatic  breast  cancer  (B.  Kristensen  et  al„  J.  Intern.  Med.  232:  237-245, 
1992).  Of  30  patients,  1 5  were  randomly  selected  to  receive  prednisolone.  The  other 
15  formed  a  control  group.  Normalization  in  their  level  of  serum-ionized  calcium 
was  achieved  by  7  of  the  treated  patients  and  none  of  the  control  group.  Obtain  a 
95%  confidence  interval  for  the  odds  ratio  using  (a)  the  Wald  interval  and  (b)  the 
profile  likelihood.  In  each  case,  note  the  effect  of  the  zero  cell  count. 

3.7  In  professional  basketball  games  during  2009-2010,  when  Kobe  Bryant  of  the  Los 
Angeles  Lakers  shot  a  pair  of  free  throws,  8  times  he  missed  both,  152  times 
he  made  both,  33  times  he  made  only  the  first,  and  37  times  he  made  only  the 
second.  Is  it  plausible  that  the  successive  free  throws  are  independent?  (Source  of 
data:  www.  nba .  com  and  appendix  of  article  224532  (vol.  6,  201 1)  by  G.  Yaari  and 
S.  Eisenmann  at  www .  plosone .  org  investigating  the  “hot  hand”  in  sports.) 

3.8  Refer  to  Exercise  3.3  and  Table  3.10. 

a.  Find  the  z  statistic  (3.12)  and  explain  how  it  relates  to  a  chi-squared  test. 

b.  Find  a  score  or  profile  likelihood  confidence  interval  for  the  odds  ratio,  and 
compare  it  to  the  Wald  interval. 
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3.9  Go  to  sda .  berkeley .  edu/GSS  and  download  a  contingency  table  relating  attained 
education  and  the  fundamentalism  of  one’s  religious  beliefs,  for  the  most  recent 
survey.  The  GSS  variable  names  are  EDUCATION  and  FUND,  and  you  can  enter 
the  year  in  the  Selection  Filter,  such  as  YEAR(2010).  Using  the  GSS  capabilities  or 
software,  conduct  the  following  analyses: 

a.  Report  chi-squared  statistics,  df  values,  P-values,  and  interpret. 

b.  Conduct  a  residual  analysis,  and  interpret.  (The  standardized  residuals  are  gener¬ 
ated  by  the  GSS  if  you  check  “Show  z  statistic”  on  their  menu.) 

3.10  As  in  the  previous  exercise,  download  recent  GSS  data  and  perform  analyses  to 
answer  the  questions  asked. 

a.  Are  people  happier  who  believe  in  life  after  death?  Analyze  using  the  GSS 
variables  HAPPY  and  POSTLIFE. 

b.  Is  belief  in  the  existence  of  God  associated  with  party  ID?  Analyze  the  3x6 
table  resulting  from  using  the  GSS  variables  GOD  and  PARTYID,  combining  the 
PARTYID  categories  0  and  1  for  Democrat,  2,  3,  and  4  for  Independent,  and  5 
and  6  for  Republican. 

3.11  Refer  to  Table  3.11,  GSS  data  on  party  ID  and  race. 

a.  Using  X 2  and  G2,  test  the  hypothesis  of  independence  between  party  identifica¬ 
tion  and  race.  Report  the  P- values  and  interpret. 

b.  Use  standardized  residuals  to  describe  the  evidence  of  association. 

c.  Partition  chi-squared  into  components  regarding  the  choice  between  Democrat 
and  Independent  and  between  these  two  combined  and  Republican.  Interpret. 


Table  3.11  Data  for  Exercise  3.11  on  Party  ID  and  Race 


Party  Identification 

Race 

Democrat 

Independent 

Republican 

Black 

192 

75 

8 

White 

459 

586 

471 

Source:  2008  General  Social  Survey,  National  Opinion  Research  Center. 


3.12  Using  the  2008  GSS,  we  cross-classified  party  ID  with  gender.  Table  3.12  shows 
some  results.  Explain  how  to  interpret  all  the  results  on  this  printout.  (Reschi  denotes 
the  Pearson  residual  and  StReschi  denotes  the  standardized  residual.) 

3.13  A  recent  study  (by  R.  Armenio  et  al.,  J.  Am.  Dent.  Assoc.  139:  592-597,  2008)  re¬ 
ported  results  of  a  double-blind  randomized  clinical  trial  comparing  tooth  sensitivity 
for  14  patients  using  a  fluoride  gel  to  15  patients  using  placebo.  Each  patient  had 
weekly  visits  for  responses,  between  3  and  7  times.  The  authors  reported  a  2  x  2 
table  having  counts  ( 1 1 , 57)  for  placebo  and  (21, 62)  for  fluoride  gel  for  the  (yes,  no) 
response  on  tooth  sensitivity.  They  reported  a  P-value  of  0.2  for  a  chi-squared  test 
comparing  the  two  treatments.  Discuss  the  suitability  of  this  analysis.  [Hint:  Are  the 
observations  independent?] 
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Table  3.12  Results  for  Exercise  3.12  on  Party  ID  and  Gender 


Expected 

Frequency 

dem 

indep 

repub 

female 

422 

393 .41 

381 

407.05 

273 

275.55 

male 

299 

327.59 

365 

338 . 95 

232 

229.45 

Statistic 

DF 

Value 

Prob 

Chi-Square 
Likelihood  Ratio 

2 

Chi-Square  2 

8.2943 

8 .3090 

0 . 0158 

0 . 0157 

Resraw  Reschi  StReschi 

Observ 

Resraw 

Reschi 

StReschi 

28.59  1.58 

2.69 

4 

28.59 

1.44 

2.69 

26.05  1.41 

2.43 

5 

—  26.05 

-1.29 

-2.43 

2.55  0.17 

0.26 

6 

-2.55 

-0.15 

-0.26 

Table  3.13  Data  for  Exercise  3.14  on  Psychiatric  Diagnoses 


Diagnosis 

Drugs 

No  Drugs 

Schizophrenia 

105 

8 

Affective  disorder 

12 

2 

Neurosis 

18 

19 

Personality  disorder 

47 

52 

Special  symptoms 

0 

13 

Source :  Reprinted  with  permission  from  E.  Helmes  and  G.  C.  Fekken, 
J.  Clin.  Psychol.  42:  569-576,  1986. 


3.14  Table  3.13  classifies  a  sample  of  psychiatric  patients  by  their  diagnosis  and  by 
whether  their  treatment  prescribed  drugs.  Partition  chi-squared  into  three  compo¬ 
nents  to  describe  differences  and  similarities  among  the  diagnoses,  by  comparing 
(i)  the  first  two  rows,  (ii)  the  third  and  fourth  rows,  and  (iii)  the  last  row  to  the  first 
and  second  rows  combined  and  the  third  and  fourth  rows  combined. 

3.15  A  GSS  that  cross-classified  income  in  thousands  of  dollars  (<5,  5-15,  15-25,  >25) 
by  job  satisfaction  (very  dissatisfied,  a  little  satisfied,  moderately  satisfied,  very 
satisfied)  for  black  Americans  produced  a  4  x  4  table  having  counts,  by  row,  (2,  4, 
13,  3,  /  2,  6,  22,4/0,  1,  15,  8  /  0,  3,  13,  8).  For  this  table,  X2  =  11.5  (P  =0.24), 
whereas  using  scores  (3,  10,  20,  35)  for  income  and  (1,  3,  4,  5)  for  job  satisfaction, 
M 2  =  7.04  (P  =  0.008).  Explain  why  the  results  are  so  different. 

3.16  A  study  on  educational  aspirations  of  high  school  students  (S.  Crysdale,  Int.  J. 
Compar.  Sociol.  16:  19-36,  1975)  measured  aspirations  with  the  scale  (some  high 
school,  high  school  graduate,  some  college,  college  graduate).  The  student  counts 
in  these  categories  were  (9,  44,  13,  10)  when  family  income  was  low,  (11,  52, 
23,  22)  when  family  income  was  middle,  and  (9,  41,  12,  27)  when  family  income 
was  high. 
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a.  Test  independence  of  educational  aspirations  and  family  income  using  X 2  or  G1 . 
Explain  the  deficiency  of  this  test  for  these  data. 

b.  Find  the  standardized  residuals.  Do  they  suggest  any  association  pattern? 

c.  Conduct  an  alternative  test  that  may  be  more  powerful.  Interpret. 

3.17  Refer  to  Table  2.13  on  homosexual  sex  and  premarital  sex. 

a.  Construct  and  interpret  a  mosaic  plot. 

b.  Obtain  a  95%  confidence  interval  for  gamma.  Interpret  the  association. 

3.18  Table  3.14  shows  the  results  of  a  retrospective  study  comparing  radiation  therapy 
with  surgery  in  treating  cancer  of  the  larynx.  The  response  indicates  whether  the 
cancer  was  controlled  for  at  least  two  years  following  treatment.  Table  3.15  shows 
SAS  output.  Some  R  output  looks  like: 


>  fisher .test (matrix (c(21,2,15,3) , ncol=2 , byrow=TRUE) , 
alternative®" two . sided" )  p-value  =  0.6384 

>  fisher .test (matrix (c (21,2,15,3) , ncol=2 , byrow=TRUE) , 
alternative="greater" )  p-value  =  0.3808 

>  fisher . test (matrix (c (21 , 2,15,3) , ncol=2 , byrow=TRUE) , 
alternative="less")  p-value  =  0.8947 

a.  Report  and  interpret  the  P-value  for  Fisher’s  exact  test  with  (i)  Ha:  9  >  1  and 
(ii)  Ha:  9 ^  1.  Explain  how  the  P- values  are  calculated. 

b.  Find  and  interpret  the  mid  P-value  for  Ha:  9  >  1.  Summarize  advantages  and 
disadvantages  of  this  type  of  P-value. 


Table  3.14  Data  for  Exercise  3.18  on  Therapy  for  Cancer  of  Larynx 


Cancer  Controlled 

Cancer  Not  Controlled 

Surgery 

21 

2 

Radiation  therapy 

15 

3 

Source:  Reprinted  with  permission  from  W.  M.  Mendenhall.  R.  R.  Million,  D.  E. 
Sharkey,  and  N.  J.  Cassisi,  Int.  J.  Radiat.  Oncol.  Biol.  Phys.  10:  357-363,  1984, 
©  Pergamon  Press. 


Table  3.15  SAS  Output  for  Exercise  3.18  on 
Therapy  for  Cancer  of  Larynx 

Fisher's  Exact  Test 

Cell  (1,1)  Frequency  (F) 

21 

Left-sided  Pr  <=  F 

0 .8947 

Right-sided  Pr  >=  F 

0.3808 

Table  Probability  (P) 

0 .2755 

Two-sided  Pr  <=  P 

0.6384 
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3.19  A  study  in  the  Department  of  Wildlife  Ecology  at  the  University  of  Florida  sampled 
wild  common  carp  fish  from  a  wetland  in  central  Chile.  One  analysis  investigated 
whether  the  fish  muscle  had  lead  pollutant  and  whether  there  was  evident  malforma¬ 
tion  in  the  fish.  Of  25  fish  without  lead,  7  had  malformation.  Of  14  with  lead,  7  had 
malformation.  Report  and  interpret  the  P- value  for  Fisher’s  exact  test  for  a  one-sided 
alternative  of  a  greater  chance  of  malformation  when  there  is  lead  pollution. 

3.20  Seneta  and  Phipps  (2001)  described  a  medical  study  that  compared  subjects  with 
nonacute  appendicitis  and  with  acute  appendicitis  in  terms  of  whether  they  suffered 
severe  right  abdominal  pain.  Such  severe  pain  was  reported  by  5  of  the  15  nonacute 
cases  and  by  1  of  the  16  acute  cases.  The  doctors  believed  that  greater  density  of 
nerve  fibres  in  the  nonacute  cases  could  increase  the  chance  of  such  pain.  Find  and 
interpret  the  P- value  for  a  one-sided  (a)  Fisher’s  exact  test  and  (b)  unconditional 
exact  test  for  two  binomials. 

3.21  Analyze  Table  3.1  using  the  Bayesian  approach  with  independent  uniform  prior 
distributions. 

a.  Specify  the  posterior  distribution  of  (jti ,  tti). 

b.  Using  software  or  your  own  simulation,  estimate  the  posterior  mean  of  the  dif¬ 
ference  of  proportions  and  find  a  95%  equal-tail  posterior  interval.  Interpret. 

3.22  Refer  to  the  table  (11,0/0,  1)  analyzed  with  Bayesian  methods  in  Section  3.6.4. 
Using  simulation,  estimate  P( n\  >  tt2 | jy i ,  n\\yi,  n{)  for  independent  beta(ai,  cz2) 
priors  having  (a)  ce[  =  a2  =  2,  (b)  a\  —  &2  =  1,  and  (c)  oq  =  aq  —  0.50.  Interpret. 

3.23  Table  3.16  cross-classifies  votes  in  the  2000  and  2004  U.S.  presidential  elections. 
Treating  the  two  rows  as  independent  binomials  and  using  uniform  priors,  generate 
the  posterior  distribution  of  the  odds  ratio.  Plot  it,  and  find  a  95%  equal-tail  or  HPD 
posterior  interval.  What  is  the  disadvantage  of  an  HPD  interval  here? 


Table  3.16  Data  on  Presidential  Votes  in  2000 
and  2004,  for  Exercise  3.23 


Political 

Vote  in  2004 

Vote  in  2000 

Bush 

Kerry 

Bush 

763 

65 

Gore 

59 

680 

Source:  2006  General  Social  Survey 


Theory  and  Methods 

3.24  Is  6  the  midpoint  of  commonly  used  confidence  intervals  for  the  odds  ratio  61  Why 
or  why  not? 

3.25  For  comparing  two  binomial  samples  with  fixed  sample  sizes,  show  that  the  stan¬ 
dard  error  (3.1)  of  a  log  odds  ratio  increases  when,  for  either  sample,  the  absolute 
difference  of  proportions  of  successes  and  failures  increases.  [ Hint:  Show  that  the 


EXERCISES 


109 


asymptotic  variance  is  minimized  when  each  binomial  probability  is  0.50.  In  partic¬ 
ular,  when  an  outcome  is  relatively  uncommon,  estimates  of  the  log  odds  ratio  tend 
to  be  imprecise.] 

3.26  Using  the  delta  method  as  in  Section  3.1.6,  show  that  the  Wald  confidence  interval 
for  the  logit  of  a  binomial  parameter  tt  is 

log[jr/(l  -  if)]  ±  za/2/Jnfc{\  -  A). 

Explain  how  to  use  this  interval  to  obtain  one  for  tt  itself.  [Newcombe  (2001 )  noted 
that  the  sample  logit  is  also  the  midpoint  of  the  score  interval  (1.14)  for  tt,  on  the 
logit  scale.  He  showed  that  this  logit  interval  contains  the  score  interval.] 

3.27  Fortwoparameters.aconfidenceinterval  for$i  —  (6  based  on  single-sample  estimate 
0j  and  interval  (£, ,  u, )  for  0, ,  i  —  1, 2,  is 

(0|  -  §2  -  7(0,  -£,)2  +  (h2-02)2,  01  -  02  +  >/(M,  -0i)2  +  (02-€:)2) • 

Newcombe  (1998b)  proposed  an  interval  for7T|  —  7r2  using  the  score  interval  (£,-,  u,) 
for  7t j  that  performs  similarly  to  the  score  method  of  Section  3.2.5.  It  is  (if  i  —  if 2  — 
Za/2SL,  TTt  -TT2  +  Za/2 Su),  with 


£  i  ( 1  —  £\)  m2(1  —  u2) 

n  1  n2 

Show  that  this  has  the  general  form  above  of  an  interval  for  9\  —  02. 

3.28  For  multinomial  sampling,  use  the  asymptotic  variance  of  log  9  to  show 
that  for  Yule’s  Q  (Exercise  2.38)  the  asymptotic  variance  of  ~Jn(Q  —  Q)  is 

(Z,Zj*t')0  -02)2/4(Yule  1900,  1912). 

3.29  For  multinomial  probabilities  n  =  (jti,  jt2,  . . .)  with  a  contingency  table  of  ar¬ 

bitrary  dimensions,  consider  a  measure  of  form  g(n)  =  v/8.  Show  that  the 
asymptotic  variance  of  y/n[g(n )  -  #(jt)]  is  a1  =  [£,•  -  (£f  7T,  t),)‘]/54, 

where  r/j  =  8(dv/djtj)  —  v(d8/djij)  (Goodman  and  Kruskal  1972). 

3.30  Show  that  Y2  =  n  X],  T.j(Pu  “  Pi+P+jf/Pa  P+j  -  «  E,  Ej  Pi+P+j(aij  ~  U2 
for  the  sample  association  factors  {a,;j.  Thus,  X2  can  be  large  when  n  is  large, 
regardless  of  whether  the  association  is  practically  important.  Explain  why  this  test, 
like  other  tests,  merely  indicates  the  degree  of  evidence  against  Hq  and  does  not 
describe  strength  of  association.  (“Like  fire,  the  chi-square  test  is  an  excellent  servant 
and  a  bad  master,”  Sir  Austin  Bradford  Hill,  Proc.  R.  Soc.  Med.  58:  295-300.  1965.) 

3.31  For  a  2  x  2  table,  consider  Hq\tt\  \  =  92.tt\2  —  n2\  =0(1  —  0),  tc22  =  (1  —  0)2- 

a.  Show  that  the  marginal  distributions  are  identical  and  that  independence  holds. 

b.  For  a  multinomial  sample,  under  Hq  show  that  9  =  (p\+  +  p+\)/2. 


su  = 


»|(1  —  Ml)  ^  £2(1  ~  £2) 


n  1 


n  2 
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c.  Explain  how  to  test  Hq.  Show  that  df  =  2  for  the  test  statistic. 

d.  Refer  to  Exercise  3.7.  Are  Kobe  Bryant’s  pairs  of  free  throws  plausibly  indepen¬ 
dent  and  identically  distributed? 

3.32  For  testing  independence,  show  that  X2  <  n  min(/  —  1,  J  —  1).  Hence  V2  = 
X2 /[nmm(l  —  1  ,J—  1)]  falls  between  0  and  1  (Cramer  1946).  [For  2  x  2  ta¬ 
bles,  X2/n  is  often  called  phi-squared ;  it  equals  Goodman  and  Kruskal’s  tau  of 
Exercise  2.39.  Other  measures  based  on  X2  include  the  contingency  coefficient 
[X2/(X2  +  n)]x/1,  which  Pearson  ( 1904)  proposed  as  an  estimate  of  the  correlation 
for  an  underlying  bivariate  normal  distribution.] 

3.33  For  a  1  x  2  table  (i.e.,  a  single  binomial  Y  based  on  n  trials,  with  probabilities  n  and 
1  —  7r),  consider  testing  Hq:  tt  — 

a.  Show  that  the  Pearson  residuals  are 


(y  ~  and  -  (y  -  n7i0)/y/n(  \  -  n0), 

which  have  differing  absolute  values  when  tiq  0.50. 

b.  Show  that  the  standardized  residuals  are 


(y  ~  n7r0)/y/n7To(l  -  7r0)  and  -  (y  -  mto)/ v/«7To(1  -  7r0). 

Explain  why  these  are  more  suitable  than  Pearson  residuals. 

3.34  For  a  2  x  2  table,  show  that: 

a.  The  four  Pearson  residuals  may  take  different  values. 

b.  All  four  standardized  residuals  have  the  same  absolute  value.  (This  is  sensible, 
since  df  =  1 .) 

c.  The  square  of  each  standardized  residual  equals  X2. 

3.35  Use  a  partitioning  argument  to  explain  why  G 2  for  testing  independence  can¬ 
not  increase  after  combining  two  rows  (or  two  columns)  of  a  contingency  table. 
[Hint:  Explain  why  G 2  for  full  table  =  G2  for  collapsed  table  +G2  for  table  of  the 
two  rows  that  are  combined  in  the  collapsed  table.] 

3.36  Assume  independence,  and  let  p,y  =  ra,y/n  and  7f,y  =  p,+p+j. 

a.  Show  that  p,y  and  jf,y  are  unbiased  for  7r,y  =  jr/+jr+y. 

b.  Show  that  var( p,y)  =  jti+  n+J(  1  —  jr,+  n+J)/ n. 

c.  Using  E(pi+p+J)2  =  E(pf+)E(p+j)  and  E(pf+)  =  var (pi+)  +  [E(pi+)]2,  show 
that 


var(7f,y)  =  [ni+7t+J[7t,+(  \  -  7i+J)  +  tz+j(\  -i r, +)]}//! 
+7T/  +  (1  -7T;+)7T+y(l  -7T+y)/«2. 
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d.  As  n  —*■  oo,  show  that  limvarC^/n  7r,y)  <  limvar(V/«  p, ,y),  with  equality  only  if 
iijj  —  1  or  0.  Hence,  if  the  model  holds  or  if  it  nearly  holds,  the  model  estimator 
is  better  than  the  sample  proportion. 

3.37  Consider  an  /  x  J  table  with  ordered  columns  and  unordered  rows.  Ridits  (Bross 
1958)  are  data-based  column  scores.  The  yth  sample  ridit  is  the  average  cumulative 
proportion  within  category  j, 

n  =  J^P+k  +  Q)  P+r 

The  sample  mean  ridit  in  row  ( is  R,  —  Yj  ? j  Pj\i-  Show  that  V  ■  P+P'i  ~  0.50  and 
Yi  Pi+Pi  =  0.50.  [For  ridit  analyses,  see  Agresti  (2010,  Sec.  2.1),  Beder  and  Heim 
(1990),  Bross  (1958),  Fleiss  et  al.  (2003,  Sec.  9.4),  and  Landis  et  al.  (1978).] 

3.38  Show  that  the  sample  value  of  the  uncertainty  coefficient  (2.13)  satisfies  U  = 
—G2/2n  (£  P+j  log  p+j)-  [Haberman  (1982)  gave  its  standard  error.] 

3.39  Of  six  candidates  for  three  managerial  positions,  denote  the  females  by  FI,  F2,  F3 
and  the  males  by  Ml,  M2,  M3. 

a.  Identify  the  20  possible  combinations  of  candidates  that  could  be  selected.  Con¬ 
struct  the  contingency  table  for  the  actual  sample,  which  is  (F2,  Ml,  M3). 

b.  Let  p\  denote  the  sample  proportion  of  males  selected  and  pi  the  sam¬ 
ple  proportion  of  females  selected.  Of  the  20  possible  samples,  show  that 
10  have  p\  —  p2  >  |.  Thus,  if  the  three  managers  were  randomly  selected, 
P  ( pi  —  P2  >  j)  =  10/20  =  0.50.  Explain  why  this  is  the  E-value  for  Fisher’s 
exact  test  with  Ha:  7T|  >  H2- 

3.40  When  a  test  statistic  has  a  continuous  distribution,  the  P-  value  has  a  null  uniform  dis¬ 
tribution,  E(E-value  <  a)  =  a  for  0  <  a  <  1.  For  Fisher’s  exact  test,  explain  why 
E(E-value  <  a)  <  a.  [Hint:  E(E-value  <  a)  =  E[P(P-v alue  <  a\n\+,  n+ 1,  n)].] 

3.41  Note  3.3  showed  moments  of  the  hypergeometric  distribution  (3.17).  Letting  p  = 
n+\/n,  show  that  n\\  has  the  same  mean  as  a  binomial  random  variable  for  n\  + 
trials  with  success  probability  p,  and  that  it  has  its  variance  multiplied  by  a  finite 
population  correction  factor  ( n  —  n\+)/(n  —  1).  (The  hypergeometric  is  similar  to 
the  binomial  when  n  i+  is  small  compared  to  n.) 

3.42  For  the  tea-tasting  data  (Table  3.9),  construct  the  null  distributions  of  the  ordinary 
P- value  and  the  mid  P- value  for  Fisher’s  exact  test  with  Ha:0  >  1 .  Find  and  compare 
their  expected  values. 

3.43  In  Section  3.5.6  we  analyzed  a  2  x  2  table  having  entries  (3,  0/0,  3).  Explain  why 
the  unconditional  E-value,  evaluated  at  n  =  0.50,  is  related  to  Fisher  conditional 
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P-values  for  various  tables  by 

6 

P(X 2  >  6)  =  P(X2  >  6\n+i  =  k)P(n+ ,  =  k). 

k= 0 


Thus,  the  unconditional  P- value  of  ^  is  a  weighted  average  of  the  Fisher  P- value  for 
the  observed  column  margins  and  P-values  of  0  corresponding  to  the  impossibility 
of  getting  results  as  extreme  as  observed  if  other  margins  had  occurred  (i.e.,  ~  = 


o-io  [(f)  (i)6]). 


The  Fisher  quote  in  Section  3.5.7  gave  his  view  about  this. 


3.44  For  testing  Hq:  n\  =  7T?  with  two  binomial  variates  y\  and  yo,  a  “reasonable”  test 
would  not  reject  Hq  if  yi  =  y2  =  0.  Since  as  and  Hi  approach  0,  the  probability  of 
this  converges  to  1  even  if  7t\  ^  tt2,  explain  why  any  such  test  is  biased,  potentially 
having  power  less  than  its  size  (Haber  1986). 


3.45  For  independent  uniform  prior  distributions  for  two  binomial  parameters,  show  that 
r  =  7T \  / 7T2  has  prior  density  g(r )  =  |  for  0  <  r  <  1  and  g(r)  =  1  /2 r2  for  r  >  1 . 

3.46  Explain  why  a  Bayesian  HPD  interval  is  sensible  for  n \  —  m  but  not  usually  for 
Tt\/7Z2- 


3.47  Consider  a  particular  choice  of  Dirichlet  means  { yy  =  E(mj)  =  ay/K )  for  the  Bayes 
estimator  (1.19)  extended  to  two-way  tables.  Show  that  the  total  mean  squared  error  is 

[K/(n  +  K )]2  [£>„  -  YlJ )2]  +  [n/(n  +  K)]2  [l  - 

divided  by  n.  Show  that  the  value  of  K  that  minimizes  this  is 

K  =  (!  -  E  /  [E^i/  -  n^\  ■ 

Fienberg  and  Holland  (1973)  showed  this  and  used  the  empirical  Bayes  approach  of 
estimating  K  by  replacing  n  by  the  sample  proportion  p  and  letting  { y,/  =  p,  +  p+j}- 
Albert  (2010)  surveyed  Bayesian  methods  for  smoothing  contingency  tables. 


CHAPTER  4 


Introduction  to  Generalized 
Linear  Models 


In  Chapters  2  and  3  we  focused  on  methods  for  two-way  contingency  tables.  Most  studies, 
however,  have  several  explanatory  variables,  and  they  may  be  continuous  as  well  as  cate¬ 
gorical.  Modeling  helps  us  to  efficiently  evaluate  effects  and  provide  improved  estimates  of 
response  probabilities  because  of  the  parsimonious  reduction  in  the  number  of  parameters. 

The  rest  of  the  book  focuses  on  model  building  for  categorical  response  variables.  In  this 
chapter  we  introduce  a  family  of  generalized  linear  models  that  contains  important  models 
for  categorical  responses  as  well  as  standard  models  for  continuous  responses.  Section  4.1 
defines  three  components  common  to  all  generalized  linear  models.  Section  4.2  illustrates 
with  models  for  binary  responses.  The  most  important  case  is  logistic  regression ,  a  linear 
model  for  the  log  odds  (logit)  transformation  of  a  binomial  parameter.  In  Chapters  5  through 

8  we  study  these  models  in  detail. 

In  Section  4.3  we  present  generalized  linear  models  for  counts.  A  Poisson  regression 
model  called  a  loglinear  model  is  a  linear  model  for  the  log  of  a  Poisson  mean.  In  Chapters 

9  and  10  we  use  them  for  modeling  counts  in  contingency  tables  having  multiple  response 
variables. 

Sections  4.4  through  4.7  are  more  technical.  Readers  wanting  mainly  an  overview 
of  methods  can  skip  them  or  read  them  lightly.  Section  4.4  shows  likelihood  equations 
and  the  asymptotic  covariance  matrix  of  maximum  likelihood  (ML)  parameter  estimates 
for  generalized  linear  models,  and  Section  4.5  summarizes  inferential  methods.  Methods 
of  solving  the  likelihood  equations  are  presented  in  Section  4.6.  In  the  final  section  we 
introduce  a  generalization,  quasi-likelihood,  that  further  extends  the  scope  of  models. 


4.1  THE  GENERALIZED  LINEAR  MODEL 

Generalized  linear  models  (GLMs)  extend  ordinary  regression  models  to  encompass 
nonnormal  response  distributions  and  modeling  functions  of  the  mean.  They  have  three 
components:  A  random  component  identifies  the  response  variable  Y  and  its  probabil¬ 
ity  distribution;  a  systematic  component  specifies  explanatory  variables  used  in  a  linear 
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predictor  function;  and  a  link  function  specifies  the  function  of  E(Y)  that  the  model  equates 
to  the  linear  predictor.  Nelder  and  Wedderburn  (1972)  introduced  the  class  of  GLMs, 
although  the  most  important  models  in  the  class  were  established  before  then. 

4.1.1  Components  of  Generalized  Linear  Models 

The  random  component  of  a  GLM  consists  of  a  response  variable  Y  with  independent 
observations  (  vi , . . . ,  VW )  from  a  distribution  in  the  natural  exponential  family.  This  family 
has  probability  density  function  or  mass  function  of  form 


f(yi\0i)  =  a(0i)b(yj)exp[yiQ(0i)].  (4.1) 

Several  important  distributions  are  special  cases,  including  the  Poisson  and  binomial.  The 
value  of  the  parameter  0,  varies  for  /  =  1 , ....  N  as  a  function  of  values  of  explanatory 
variables.  The  parameter  Q(6)  is  called  the  natural  parameter.  In  Section  4.4  we  present 
a  more  general  formula  (4. 17)  for /  that  also  permits  a  dispersion  parameter,  but  (4.1)  is 
sufficient  for  the  discrete  data  models  that  are  the  primary  focus  of  this  book. 

The  systematic  component  of  a  GLM  relates  a  vector  (t]\ , . . . ,  rjp)  to  the  explanatory 
variables  through  a  linear  model.  Let  x,j  denote  the  value  of  explanatory  variable  j  (j  = 
0,  1,2,...)  for  subject  i.  Then 


n,  =  /*/',;/.  /  =  1 . N. 

j 

This  linear  combination  of  explanatory  variables  is  called  the  linear  predictor.  Usually, 
x/o  =  1  for  all  /,  representing  the  coefficient  of  an  intercept  term  fo  (often  denoted  by  a)  in 
the  model. 

The  third  component  of  a  GLM  is  a  link  function  that  connects  the  random  and  systematic 

components.  Let  /x,  —  E{Y/),i  =  1 . N .  The  model  links  p,j  to  rjj  by  77,  =  g(fJti),  where 

the  link  function  g  is  a  monotonic,  differentiable  function.  Thus,  g  links  pc,  to  explanatory 
variables  through  the  formula 

8<Mi)  =  'E,Pjxu,  (4.2) 

J 

The  link  function  g{p)  =  p,  called  the  identity  link,  has  >]/  =  p,.  It  specifies  a  linear 
model  for  the  mean  itself.  This  is  the  link  function  for  ordinary  regression  with  normally 
distributed  Y .  The  link  function  that  transforms  the  mean  to  the  natural  parameter  is  called 
the  canonical  link.  For  it,  g(pi)  =  Qdf).  and  Q  ( 0, )  =  Pjxij ■  Sections  4.1.2  and  4.1.3 
show  examples. 

In  summary,  a  GLM  is  a  linear  model  for  a  transformed  mean  of  a  response  variable  that 
has  distribution  in  the  natural  exponential  family.  We  now  illustrate  the  three  components 
by  introducing  the  key  GLMs  for  discrete  response  variables. 

4.1.2  Binomial  Logit  Models  for  Binary  Data 

Many  response  variables  are  binary.  We  represent  the  “success’"  and  "failure”  outcomes 
by  1  and  0.  A  Bernoulli  trial  has  probabilities  P(Y  =  1)  =  7r  and  P(Y  =  0)  =  1  —  jr,  for 
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which  E{Y)  —  7T.  This  is  the  special  case  of  the  binomial  distribution  (1.1)  with  n  =  1.  We 
can  express  the  probability  mass  function  as 


f(y\7T)  -  7T-V(1  ~7 T)1  -V  =  (1  -  7l)[ k/(\  ~  7t)]' 

r  /  7T 

=  (l-7T)exp  y  log- - 

\  1  -  *  , 


(4.3) 


for  y  =  0  and  1 .  This  is  in  the  natural  exponential  family  (4. 1 ),  identifying  6  with  jr,  a(n)  = 
1  —TT,b(y)=  1,  and  Q(tt)  —  log[7r/(l  —  jr )].  The  natural  parameter  log[7r/(l  —  tt  )]  is  the 
log  odds  of  response  outcome  1,  the  logit  of  n.  This  is  the  canonical  link  function.  GLMs 
using  the  logit  link  are  introduced  further  in  Section  4.2.3.  They  are  referred  to  as  logistic 
regression  models ,  or  sometimes  simply  as  logit  models. 


4.1.3  Poisson  Loglinear  Models  for  Count  Data 

Some  response  variables  have  counts  as  their  possible  outcomes.  In  a  health  survey,  each 
observation  might  be  the  number  of  illnesses  in  the  past  year  for  which  the  subject  visited 
a  doctor.  Counts  also  occur  as  entries  in  contingency  tables. 

The  simplest  distribution  for  count  data  is  the  Poisson.  The  Poisson  probability  mass 
function  ( 1 .4)  for  a  count  Y  is 

e~tlay  (  1  \ 

f{y\\x)  =  —  =  exp (-n)  ^  —  J  exp[y(log  fi)l  y  =  0,  1,2 . 

This  has  natural  exponential  form  (4.1)  with  9  —  fc,  a(/x)  —  exp(-H),  b(y )  =  1  />•!,  and 
Q(h)  =  log  fx.  The  natural  parameter  is  log  /r,  so  the  canonical  link  function  is  the  log 
link,  t]  =  log  n.  The  model  using  this  link  function  is 

log  Hi  =  ^  PjXtj,  i  =  1 . N.  (4.4) 

j 

This  model,  to  be  introduced  further  in  Section  4.3.1 ,  is  called  a  Poisson  loglinear  model. 


4.1.4  Generalized  Linear  Models  for  Continuous  Responses 

The  class  of  GLMs  also  includes  models  for  continuous  responses.  The  normal  distribution 
is  in  a  natural  exponential  family  that  includes  dispersion  parameters.  Its  natural  parameter 
is  the  mean.  Therefore,  an  ordinary  regression  model  is  a  GLM  using  the  identity  link. 
Table  4. 1  lists  this  and  other  standard  models  for  a  normal  random  component.  The  table 
also  lists  GLMs  for  discrete  responses  that  are  presented  in  Chapters  5-10. 


4.1.5  Deviance  of  a  GLM 

For  a  particular  GLM  with  observations  y  =  (yi, . . . ,  yu),  let  L(fi:  y)  denote  the  log- 

likelihood  function  expressed  in  terms  of  the  means  fi  =  (m . Hn)-  Let  L(/t;  y)  denote 

the  maximum  of  the  log  likelihood  for  the  model.  Considered  for  all  possible  models,  the 
maximum  achievable  log  likelihood  is  L(y\  y).  This  occurs  for  the  most  general  model. 
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Table  4.1  Types  of  Generalized  Linear  Models  for  Statistical  Analysis 


Random 

Component 

Link  Function 

Systematic 

Component 

Model 

Chapters 

Normal 

Identity 

Continuous 

Regression 

Normal 

Identity 

Categorical 

Analysis  of  variance 

Normal 

Identity 

Mixed 

Analysis  of  covariance 

Binomial 

Logit 

Mixed 

Logistic  regression 

5  and  6 

Binomial 

Probit  and  others 

Mixed 

Binary  regression 

7 

Multinomial 

Generalized  logit 

Mixed 

Multinomial  response 

8 

Poisson 

Log 

Mixed 

Loglinear 

9  and  10 

having  a  separate  parameter  for  each  observation  and  the  perfect  fit  ji  =  y.  Such  a  model 
is  called  the  saturated  model.  This  model  is  not  useful,  because  it  does  not  provide  data 
reduction.  However,  it  serves  as  a  baseline  for  comparison  with  other  model  fits. 

The  deviance  of  a  Poisson  or  binomial  GLM  is  defined  to  be 

-2 [L{H\y)  -  L(y;y)]. 

This  is  the  likelihood-ratio  statistic  for  testing  the  null  hypothesis  that  the  model  holds 
against  the  general  alternative  (i.e.,  the  saturated  model).  We  use  the  deviance  throughout 
the  book  for  model  checking  and  for  inferential  comparisons  of  models.  Methods  for 
analyzing  the  deviance  generalize  analysis  of  variance  methods  for  normal  linear  models. 

For  some  applications  with  Poisson  and  binomial  GLMs,  the  number  of  observations  N 
is  fixed  and  the  individual  counts  are  relatively  large.  Then  the  deviance  has  an  approximate 
chi-squared  null  distribution.  The  df  =  N  —  p,  where  p  is  the  number  of  model  parameters; 
that  is,  df  equals  the  difference  between  the  numbers  of  parameters  in  the  saturated  model 
and  in  the  unsaturated  model.  The  deviance  then  provides  a  test  of  model  fit. 

One  such  example  is  independent  binomial  counts  at  N  fixed  settings  of  predictors  when 
the  number  of  trials  at  each  setting  is  large.  Let  Y,  be  bin(n,-,  n,),  i  =  1, . . . ,  N .  Consider 
the  simple  model  of  homogeneity,  jr,  =  a  all  /.  It  has  p  =  1  parameter.  The  saturated  model 
makes  no  assumption  about  {jr,  },  letting  them  be  any  N  values  between  0  and  1.0.  It  has 
N  parameters.  The  deviance  for  the  homogeneity  model  has  df  =  N  —  1 .  In  fact,  it  equals 
the  G2  likelihood-ratio  statistic  (3.1 1)  for  testing  independence  in  the  IV  x  2  contingency 
table  that  these  samples  form.  Under  independence,  its  distribution  converges  to  a  chi- 
squared  distribution  as  the  {/?,  }  increase,  for  fixed  N.  Another  example  is  a  contingency 
table  constructed  from  sample  survey  data,  in  which  the  classification  categories  and  the 
number  of  cells  N  is  fixed  as  we  collect  more  data,  and  we  treat  the  cell  counts  as  Poisson 
variates. 


4.1.6  Advantages  of  GLMs  Versus  Transforming  the  Data 

A  traditional  way  to  model  data  transforms  Y  so  that  it  has  approximately  a  normal  distri¬ 
bution  with  constant  variance;  then,  ordinary  least-squares  regression  is  applicable.  With 
GLMs,  by  contrast,  the  choice  of  link  function  is  separate  from  the  choice  of  random 
component.  If  a  link  is  useful  in  the  sense  that  a  linear  model  for  the  predictors  is  plausible 
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for  that  link,  it  is  not  necessary  that  it  also  stabilizes  variance  or  produces  normality.  This 
is  because  the  fitting  process  maximizes  the  likelihood  for  the  choice  of  distribution  for  Y, 
and  that  choice  is  not  restricted  to  normality. 

Let  g  denote  a  function,  such  as  the  log  function,  that  is  a  link  function  in  the  GLM 
approach  or  a  transformation  function  in  the  transformed  data  approach.  An  advantage  of 
the  GLM  formulation  is  that  the  model  parameters  describe  g[E(Y )},  ratherthan  £[g(T)]  as 
in  the  transformed  data  approach.  With  the  GLM  approach,  those  parameters  also  describe 
effects  of  explanatory  variables  on  E(Y ),  after  applying  the  inverse  function  for  g.  Such 
effects  are  more  relevant  than  effects  of  explanatory  variables  on  £[g(T)]. 

GLMs  provide  a  unified  theory  of  modeling  that  encompasses  the  most  important  models 
for  continuous  and  discrete  variables.  Models  studied  in  this  text  are  GLMs  with  binomial 
or  Poisson  random  component,  or  multivariate  extensions  of  GLMs.  The  ML  parameter 
estimates  are  computed  with  an  algorithm,  presented  in  Section  4.6,  that  iteratively  uses  a 
weighted  version  of  least  squares.  A  reason  for  restricting  GLMs  to  the  exponential  family 
of  distributions  for  Y  is  that  the  same  algorithm  applies  to  this  entire  family,  for  any  choice 
of  link  function. 

Nearly  all  statistical  software  has  the  facility  to  fit  GLMs.  This  text’s  computing  appendix 
at  www .  stat .  uf  1 .  edu/~aa/ cda/cda  .html  gives  details. 


4.2  GENERALIZED  LINEAR  MODELS  FOR  BINARY  DATA 

Let  Y  denote  a  binary  response  variable,  such  as  the  result  of  a  medical  treatment  (success, 
failure).  Each  observation  has  one  of  two  outcomes,  denoted  by  1  and  0,  which  we  treat 
as  a  binomial  variate  for  a  single  Bernoulli  trial.  The  mean  E(Y)  =  P(Y  =  1).  We  denote 
P(Y  =  1)  by  jt(jc),  reflecting  its  dependence  on  values  x  =  (x\ , . . . ,  xp)  of  explanatory 
variables.  The  variance  of  Y  is 


var(T)  =  7r(;c)[l  —  n(x)], 
which  is  the  binomial  variance  for  n  =  1. 

4.2.1  Linear  Probability  Model 

For  a  binary  response  variable,  the  regression  model 


n(x)  =  a  +  fi\x\  4 - 1-  Ppxp  (4.5) 

is  called  a  linear  probability  model.  With  independent  observations  it  is  a  GLM  with 
binomial  random  component  and  identity  link  function. 

This  model  has  a  major  structural  defect:  Probabilities  fall  between  0  and  1,  but  linear 
functions  take  values  over  the  entire  real  line.  Model  (4.5)  can  have  nix)  <  0  and/or 
7t(jc)  >  1  for  some  x  values.  The  model  can  be  valid  over  a  restricted  range  of  x  values. 
When  it  is  plausible,  an  advantage  is  its  simple  interpretation:  (5j  is  the  change  in  tt(x)  for 
a  one-unit  increase  in  a  , . 

We  defer  to  Section  4.6  the  technical  details  of  ML  model  fitting  for  this  and  other 
GLMs.  Since  var(T)  =  7r(x)[l  —  7r(x)],  the  variance  depends  on  x  through  its  influence  on 
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tc(x).  The  constant  variance  condition  that  makes  ordinary  least-squares  estimators  optimal 
(i.e.,  minimum  variance  in  the  class  of  linear  unbiased  estimators)  is  not  satisfied,  so  the 
ML  estimator  is  more  efficient  than  least  squares.  The  estimates  and  standard  errors  for 
ML  and  least  squares  are  usually  similar,  however,  when  ft{x)  for  the  sample  x  values 
falls  in  the  range  within  which  the  variance  is  relatively  stable,  about  0.3  to  0.7.  When 
used  with  multiple  explanatory  variables,  difficulties  often  occur  with  ML  model  fitting 
because  at  a  step  of  the  iterative  fitting  process,  ft(x)  falls  outside  the  [0,  1]  range  for 
some  subjects’  x  values.  Least-squares  fitting  still  works  in  such  cases,  but  also  typically 
gives  such  unsatisfactory  i i(x)  estimates.  Also  Y,  being  binary,  is  very  far  from  normally 
distributed,  so  the  usual  t  sampling  distribution  for  standardized  least-squares  estimators 
do  not  apply. 


4.2.2  Example:  Snoring  and  Heart  Disease 

We  illustrate  the  linear  probability  model  with  Table  4.2,  from  an  epidemiological  survey  to 
investigate  snoring  as  a  risk  factor  for  heart  disease.  The  sample  consists  of  2484  subjects 
who  visited  four  family  practice  units  in  Toronto  that  served  different  socioeconomic  classes 
and  ethnic  groups.  Those  surveyed  were  classified  according  to  their  spouses’  report  of  how 
much  they  snored  and  according  to  whether  they  reported  having  heart  disease.  The  model 
states  that  the  probability  of  heart  disease  is  linearly  related  to  the  level  of  snoring  x.  We 
treat  the  rows  of  the  table  as  independent  binomial  samples.  No  obvious  choice  of  scores 
exists  for  categories  of  x.  We  used  (0,  2,  4,  5),  treating  the  last  two  levels  as  closer  than 
the  other  adjacent  pairs.  ML  estimates  and  standard  errors  are  the  same  if  we  use  a  data 
file  of  2484  binary  observations  or  if  we  enter  the  four  binomial  totals  of  “yes”  and  “no” 
responses  listed  in  Table  4.2. 

Software  reports  the  ML  fit,  tt(.x)  =  0.0172  +  0.01 98.v,  with  ft  =  0.0198  having  SE  = 
0.0028.  For  nonsnorers  (,y  =  0),  the  estimated  proportion  of  subjects  having  heart  disease 
is  0.0172.  We  refer  to  the  estimated  values  of  E(Y )  for  a  GLM  as  fitted  values.  Table  4.2 
shows  the  sample  proportions  and  the  fitted  values  for  this  model.  Figure  4.1  graphs 
the  sample  and  fitted  values.  The  table  and  graph  suggest  that  the  model  fits  well.  (In 
Section  5.2.3  we  present  formal  goodness-of-fit  analyses  for  binary-response  GLMs.)  The 
model  interpretation  is  simple.  The  estimated  probability  of  heart  disease  is  about  0.02  for 
nonsnorers;  it  increases  2(0.0198)  =  0.04  for  occasional  snorers,  another  0.04  for  those 
who  snore  nearly  every  night,  and  another  0.02  for  those  who  always  snore.  The  study  was 
observational,  and  it  is  unclear  whether  this  association  could  be  due  to  some  confounding 
factor  or  a  medical  condition  such  as  sleep  apnea. 


Table  4.2  Relationship  Between  Snoring  and  Heart  Disease 


Snoring 

Heart  Disease 

Proportion 

Yes 

Linear 

Fit" 

Logistic 

Fit" 

Yes 

No 

Never 

24 

1355 

0.017 

0.017 

0.021 

Occasionally 

35 

603 

0.055 

0.057 

0.044 

Nearly  every  night 

21 

192 

0.099 

0.096 

0.093 

Every  night 

30 

224 

0.118 

0.1 16 

0.132 

"Model  fits  refer  to  proportion  of  yes  responses. 

Source:  P.  G.  Norton  and  E.  V.  Dunn,  Br.  Med.J.  291:  630-632.  1985.  BMJ  Publishing  Group. 
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4.2.3  Logistic  Regression  Model 

Usually,  binary  data  result  from  a  nonlinear  relationship  between  n(x )  and  x.  A  fixed 
change  in  x  often  has  less  impact  when  n(x)  is  near  0  or  1  than  when  7r(x)  is  near  0.50.  In 
the  purchase  of  an  automobile,  consider  the  choice  between  buying  new  or  used.  Let  7r(x) 
denote  the  probability  of  selecting  new  when  annual  family  income  =  x.  An  increase  of 
$10,000  in  annual  income  would  have  less  effect  when  x  =  $1 ,000,000  [for  which  7r(x)  is 
near  1]  than  when  x  =  $50,000. 

In  practice,  nonlinear  relationships  between  7r(x)  and  x  are  often  monotonic,  with  Jt(x) 
increasing  continuously  or  tt(x)  decreasing  continuously  as  x  increases.  The  S-shaped 
curves  in  Figure  4.2  are  typical.  The  most  important  curve  with  this  shape  has  the  model 
formula 


7T(X)  - 


exp(a  +  fix) 

1  +  exp(a  +  fix) 


(4.6) 


This  is  a  logistic  regression  model.  Asx  increases,  7r(x)  increases  when  ft  >  0  and  decreases 
when  ft  <  0. 

Let’s  find  the  link  function  for  which  logistic  regression  is  a  GLM.  For  (4.6)  extended 
to  multiple  predictors,  the  odds  are 


Tt{x) 

1  —  7t(x) 


=  exp(a  +  p ixi  H - b  Ppxp). 


The  log  odds  has  the  linear  relationship 


7T(*) 

1  —  7l(x) 


=  a  +  P\X\ 


+  •  •  •  +  PpXp. 


log 


(4.7) 
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p>o 


p<o 


Figure  4.2  Logistic  regression  functions. 


Thus,  the  appropriate  link  is  the  log  odds  transformation,  the  logit.  Logistic  regression 
models  are  GLMs  with  binomial  random  component  and  logit  link  function. 

The  logit  is  the  natural  parameter  for  the  binomial  distribution,  so  the  logit  link  is  its 
canonical  link  function.  Whereas  jt(jc)  must  fall  in  the  (0,  1 )  range,  the  logit  can  be  any  real 
number.  The  real  numbers  are  also  the  range  for  linear  predictors  that  form  the  systematic 
component  of  a  GLM.  So,  this  model  does  not  have  the  structural  problem  that  the  linear 
probability  model  has. 

For  the  snoring  data  in  Table  4.2,  software  reports  the  logistic  regression  ML  fit 
logit[ff  (t)]  —  —3.87  +  0.40,r. 

The  positive  $  =  0.40  reflects  the  increased  incidence  of  heart  disease  at  higher  snoring 
levels.  In  Chapters  5  and  6  we  study  logistic  regression  in  detail  and  interpret  such  equations. 
Estimated  probabilities  result  from  substituting  x  values  into  the  estimate  of  probability 
formula  (4.6).  Table  4.2  also  reports  these  fitted  values.  Figure  4.1  displays  the  fit.  The  fit 
is  close  to  linear  over  this  narrow  range  of  estimated  probabilities,  and  results  are  similar 
to  those  for  the  linear  probability  model. 

4.2.4  Binomial  GLM  for  2  x  2  Contingency  Tables 

Among  the  simplest  GLMs  for  a  binary  response  is  the  one  having  a  single  explanatory 
variable  x  that  is  also  binary.  Label  its  values  by  0  and  1.  For  a  given  link  function,  the 
GLM 


link[jr(,v)]  =  a  +  fix 
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has  the  effect  of  x  described  by 

P  =  link[7r(l)]  -  link[7r(0)]. 


For  the  identity  link,  p  =  7r(l)  —  7r(0)  is  the  difference  between  proportions.  For  the 
log  link,  p  =  log[7r  ( 1 )]  -  log[7T (0)]  =  log[7r(  1  )/7T  (0)]  is  the  log  relative  risk.  For  the  logit 
link. 


7T(1)  71  (0) 

P  =  logit[7T(l)]  -  loglt[7T(0)]  =  log  - - —  -  log  - - — 

1  —  7f(l)  1  —  71  (0) 


=  log 


~T(1)/(1  -7T(1)) 

.7T(0)/(1  -7T(0)) 


is  the  log  odds  ratio.  Measures  of  association  for  2x2  tables  are  effect  parameters  in  GLMs 
for  binary  data. 


4.2.5  Probit  and  Inverse  cdf  Link  Functions 

A  monotone  regression  curve  such  as  the  first  one  in  Figure  4.2  has  the  shape  of  a  cumulative 
distribution  function  (cdf)  for  a  continuous  random  variable.  This  suggests  a  model  for  a 
binary  response  having  form  n (x )  =  Fix )  for  some  cdf  F . 

Using  a  class  of  location-scale  cdf  s,  such  as  normal  cdf’s  with  their  variety  of  means 
and  variances,  permits  the  curve  tt(x)  —  F{x)  to  have  flexibility  in  the  rate  of  increase  and 
in  the  location  where  most  of  that  increase  occurs.  Let  4>(- )  denote  the  standard  cdf  of  the 
class,  such  as  the  N( 0,  1 )  cdf.  Using  <t>  but  writing  the  model  as 

rt(x)  =  4>(a  +  Px)  (4.8) 

provides  the  same  flexibility.  The  values  of  a  and  p  determine  the  particular  cdf  in  the 
class.  Replacing  x  by  Px  permits  the  curve  to  increase  at  a  different  rate  than  the  standard 
cdf  (or  even  to  decrease  if  p  <  0);  varying  a  moves  the  curve  to  the  left  or  right. 

When  <t>  is  strictly  increasing  over  the  entire  real  line,  its  inverse  function  <t>-1  exists 
and  (4.8)  is,  equivalently, 


<t>  x[n{x)\  — a  +  Px.  (4.9) 

For  this  class  of  cdf  shapes,  the  link  function  for  the  GLM  is  <J>— 1 .  The  link  function  maps 
the  (0,  1)  range  of  probabilities  onto  (— oo,  oo),  the  range  of  linear  predictors.  The  curve 
has  the  shape  of  a  normal  cdf  when  4)  is  the  standard  normal  cdf.  Model  (4.9)  is  then  called 
the  probit  model.  This  curve  has  similar  appearance  to  the  logistic  regression  curve.  Probit 
models  are  discussed  in  Section  7.1. 

When  p  >  0,  the  logistic  regression  curve  (4.6)  is  a  cdf  for  the  logistic  distribution.  When 
P  <  0,  the  curve  for  1  —  tt(x)  has  that  appearance.  The  cdf  of  the  logistic  distribution  with 
mean  p  and  dispersion  parameter  r  >  0  is 

F  ^  _  exp[(.v  -  p)/r] 

1  +exp[(.r  -  fi)/r]' 


OO  <  X  <  oo. 
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The  corresponding  probability  density  function  (pdf)  is  symmetric  and  bell-shaped,  with 
standard  deviation  rjr/\/3,  for  the  mathematical  constant  n  =  3.14  ....  It  looks  much  like 
the  normal  density  with  the  same  mean  and  standard  deviation  but  with  slightly  thicker 
tails.1  The  standard  form  of  the  logistic  cdf  has  fi  =  0  and  r  =  1,  so  <F(x)  =  e* /(l  +  ex). 
For  that  function,  the  logistic  regression  curve  (4.6)  has  form  jt(x)  =  d>(a  +  fix).  By  (4.9) 
the  logit  transformation  is  simply  the  inverse  function  for  the  standard  logistic  cdf;  that  is, 
when  <J>(t)  =  n(x)  =  e* /(I  -t-e*),  then  x  =  <t>-1  [7t(jc)]  —  log[7r(x)/(l  —  i r(a))]. 

4.2.6  Latent  Tolerance  Motivation  for  Binary  Response  Models 

We  now  present  another  motivation  for  the  link  function  having  the  form  of  the  inverse  of 
a  cdf.  It  results  from  early  applications  of  binary  response  models  to  toxicology  studies, 
such  as  in  Bliss  (1935),  with  an  unobserved  tolerance  distribution. 

In  toxicology,  binary  response  models  describe  the  effect  of  dosage  of  a  toxin  on  whether 
a  subject  dies.  Let  x  denote  the  dosage  level.  For  a  randomly  selected  subject,  let  Y  =  1 
if  the  subject  dies.  Suppose  that  the  subject  has  a  tolerance  threshold  T  for  the  dosage, 
with  (Y  =  1)  equivalent  to  ( T  <  jc).  For  instance,  an  insect  survives  if  the  dosage  x  is 
less  than  T  and  dies  if  the  dosage  is  at  least  T.  Tolerances  vary  among  subjects,  and  let 
F(t)  =  P(T  <  t).  For  fixed  dosage  x,  the  probability  a  randomly  selected  subject  dies  is 

7t(jc)  =  P(Y  =  1|X  =  a)  =  P(T  <x)  =  F(x). 

That  is,  the  appropriate  binary  model  is  the  one  having  the  shape  of  the  cdf  F  of  the  tolerance 
distribution. 

An  unobserved  variable  such  as  T  is  referred  to  as  a  latent  variable.  In  practice  we  do  not 
know  the  particular  F  that  generates  7,  and  we  assume  that  F  belongs  to  some  parametric 
family.  Let  <t>  denote  the  standard  cdf  for  that  family.  A  common  standardization  uses  the 
mean  and  standard  deviation  of  T,  so  that 

jt(jt)  =  F(x)  =  <S>[(x  -  jtt)/<x]. 

Then,  the  model  has  form  jr(jt)  =  <t>(a  +  fix),  for  a  —  —  fx/a  and  fi  =  1 /cr .  In  GLM  form, 

<t>~'[7r(jc)]  =  a  +  fix.  (4.10) 

Whereas  the  cdf  maps  the  real  line  onto  the  (0,  1)  probability  scale,  the  inverse  cdf  maps 
the  (0,  1)  scale  for  7r(jc)  onto  the  real  line  values  for  linear  predictors  in  binary  response 
models. 


4.3  GENERALIZED  LINEAR  MODELS  FOR  COUNTS  AND  RATES 

The  best  known  GLMs  for  count  data  assume  a  Poisson  distribution  for  Y.  We'll  use 
Poisson  GLMs  for  counts  in  contingency  tables  with  categorical  response  variables.  We 

'its  kurtosis  equals  that  of  a  t  distribution  with  df  =  9.  Albert  and  Chib  (1993)  noted  that  a  t  variate  with 
df  =  8  divided  by  0.634  well  approximates  a  standard  logistic  variate.  Caffo  and  Griswold  (2006)  used  a  similar 
approximation  with  df  =  8.78. 
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first  introduce  Poisson  GLMs  to  model  count  or  rate  data  for  a  single  nonnegative  integer¬ 
valued  response  variable. 


4.3.1  Poisson  Loglinear  Models 

The  Poisson  distribution  ( 1 .4)  has  a  positive  mean  /x.  Although  a  GLM  can  model  a  positive 
mean  using  the  identity  link,  it  is  more  common  to  model  the  log  of  the  mean.  Like  the 
linear  predictor,  the  log  mean  can  take  any  real  value.  The  log  mean  is  the  natural  parameter 
for  the  Poisson  distribution,  and  the  log  link  is  the  canonical  link  for  a  Poisson  GLM.  A 
Poisson  loglinear  GLM  assumes  a  Poisson  distribution  for  Y  and  uses  the  log  link. 

The  Poisson  loglinear  model  with  explanatory  variables  x  is 

log  n(x)  =  a  +  fi\X\  +  •  •  •  +  Ppxp.  (4.11) 

For  this  model,  the  mean  satisfies  the  exponential  relationship 

nix)  =  exp(«  +  0,jr,  +••••+  PpXp)  =  ea(ee')x '  ■  ■  ■  (Ay*.  (4.12) 

A  1-unit  increase  in  Xj  has  a  multiplicative  impact  of  e^:  The  mean  at  Xj  +  1  equals  the 
mean  at  Xj  multiplied  by  e . 


4.3.2  Example:  Horseshoe  Crab  Mating 

We  illustrate  Poisson  GLMs  using  a  study  of  female  horseshoe  crabs2  on  an  island  in  the 
Gulf  of  Mexico.  During  spawning  season,  the  females  migrate  to  a  shore  to  breed,  with 
a  male  attached  to  her  posterior  spine,  and  she  burrows  into  the  sand  and  lays  clusters  of 
eggs.  During  spawning,  other  male  crabs  may  group  around  the  pair  and  may  also  fertilize 
the  eggs.  These  male  crabs  that  cluster  around  the  female  crab  are  called  satellites. 

In  this  example,  the  response  outcome  for  each  of  173  female  crabs  is  her  number  of 
satellites.  Explanatory  variables  are  the  female  crab's  color,  spine  condition,  weight,  and 
carapace  width.  Table  4.3  shows  a  small  set  of  the  data.  The  complete  data  are  available  at 
the  text  website  www.  stat .  uf  1 .  edu/~aa/cda/cda  .  html.  For  now,  we  use  width  alone 


Table  4.3  Number  of  Male  Satellites  by  Female  Crab’s  Characteristics" 


c 

S 

W 

Wt 

Sa 

c 

S 

W 

Wt 

Sa 

C 

S 

W 

Wt 

Sa 

2 

3 

28.3 

3.05 

8 

3 

3 

22.5 

1.55 

0 

1 

1 

26.0 

2.30 

9 

3 

3 

26.0 

2.60 

4 

2 

3 

23.8 

2.10 

0 

3 

2 

24.7 

1.90 

0 

3 

3 

25.6 

2.15 

0 

3 

3 

24.3 

2.15 

0 

2 

3 

25.8 

2.65 

0 

4 

2 

21.0 

1.85 

0 

2 

1 

26,0 

2.30 

14 

1 

1 

27.1 

2.95 

8 

"C,  color  ( I.  light  medium;  2,  medium;  3,  dark  medium;  4,  dark);  S,  spine  condition  (t,  both  good;  2,  one  worn 
or  broken;  3,  both  worn  or  broken);  W,  carapace  width  (cm);  Wt,  weight  (kg);  Sa,  number  of  satellites. 

Source:  Data  courtesy  of  Jane  Brockmann,  Zoology  Department,  University  of  Florida;  study  described  in 
Ethology  102:  1-21,  1996.  The  complete  data  are  at  the  text  website. 


:See  en.wikipedia.org/wiki/Horseshoe_crab  and  horseshoecrab.org  for  details 
about  horseshoe  crabs,  including  pictures  of  their  mating. 
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Figure  4.3  Number  of  satellites  by  width  of  female  crab. 


T 

34 


as  a  predictor.  Table  4.3  lists  width  in  centimeters.  The  sample  mean  width  equals  26.3  and 
the  standard  deviation  equals  2.1. 

Figure  4.3  plots  the  response  counts  of  satellites  against  width,  with  numerical  symbols 
indicating  the  number  of  observations  at  each  point.  The  substantial  variability  makes  it  dif¬ 
ficult  to  discern  a  clear  trend.  To  get  a  clearer  picture,  we  grouped  the  female  crabs  into  width 
categories  (<23.25,  23.25-24.25,  24.25-25.25,  25.25-26.25,  26.25-27.25,  27.25-28.25, 
28.25-29.25,  >29.25)  and  calculated  the  sample  mean  number  of  satellites  for  female 
crabs  in  each  category.  Figure  4.4  plots  these  sample  means  against  the  sample  mean  width 
for  crabs  in  each  category. 

More  sophisticated  ways  of  portraying  the  trend  smooth  the  data  without  grouping  the 
width  values  or  assuming  a  particular  functional  relationship.  Figure  4.4  also  shows  a 
smoothed  curve  based  on  a  semiparametric  extension  of  the  GLM  (the  generalized  additive 
model)  presented  in  Section  7.4.9.  The  sample  means  and  the  smoothed  curve  both  show 
a  strong  increasing  trend.  (The  means  tend  to  fall  above  the  curve,  since  the  response 
counts  in  a  category  tend  to  be  skewed  to  the  right;  the  smoothed  curve  is  less  susceptible 
to  outlying  observations.)  The  trend  seems  approximately  linear,  and  we  discuss  the  next 
models  for  the  ungrouped  data  for  which  the  mean  or  the  log  of  the  mean  is  linear  in  width. 

For  a  female  crab,  let  /i(x)  be  the  expected  number  of  satellites  at  width  x.  From  GLM 
software  as  shown  in  the  Appendix  at  the  text  website,  the  ML  fit  of  the  Poisson  loglinear 
model  (4. 1 1)  is 


log  / l(x )  =  a  +  fix  =  —3.305  +  0.164x. 
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Figure  4.4  Smoothings  of  horseshoe  crab  counts. 


The  effect  $  =  0.164  of  width  is  positive,  with  SE  —  0.020.  The  model  fitted  value  at  a 
width  level  x  is  an  estimated  mean  number  of  satellites  (l(x).  For  instance,  the  fitted  value 
at  the  mean  width  of  x  —  26.3  is 

A(  x)  =  exp(a  +  $x)  =  exp[— 3.305  +  0.164(26.3)]  =  2.74. 

For  this  model,  exp(/l)  =  exp(0. 164)  =  1.18  is  the  multiplicative  effect  on  jl(x)  for  a 
1-cm  increase  in  x.  For  instance,  the  fitted  value  at  x  =  27.3  =  26.3  +  1  is  exp[— 3.305  + 
0.164(27.3)]  =  3.23,  which  equals  1.18  x  2.74.  A  1-cm  increase  in  width  yields  an  18% 
increase  in  the  estimated  mean. 

Figure  4.4  shows  that  // (x )  may  grow  approximately  linearly  with  width.  This  suggests 
the  Poisson  GLM  with  identity  link.  It  has  ML  fit 

£(x)  =  a  +  fix  =  —  1 1 .53  +  0.55x. 

This  model  has  an  additive  rather  than  a  multiplicative  effect  of  x:  A  1  -cm  increase  in  x  has 
an  estimated  increase  of  /3  =  0.55  in  p.(x).  The  fitted  values  are  positive  at  all  sampled  x, 
and  the  model  describes  the  effect  simply:  On  the  average,  about  a  2-cm  increase  in  width 
is  associated  with  an  extra  satellite. 

Figure  4.5  plots  fi(x)  against  width  for  the  models  with  log  link  and  identity  link. 
Although  they  diverge  somewhat  for  relatively  small  and  large  widths,  they  provide  similar 
predictions  over  the  width  range  in  which  most  observations  occur.  We  now  study  whether 
either  model  fits  adequately. 
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4.3.3  Overdispersion  for  Poisson  GLMs 

In  Section  1.2.4  we  noted  that  count  data  often  show  greater  variability  than  the  Poisson 
allows.  For  the  grouped  horseshoe  crab  data,  Table  4.4  shows  the  sample  mean  and  variance 
for  the  counts  of  number  of  satellites  for  the  female  crabs  in  each  width  category.  The 
variances  are  much  larger  than  the  means,  whereas  Poisson  distributions  have  identical 
mean  and  variance.  The  greater  variability  than  predicted  by  the  GLM  random  component 
reflects  overdispersion. 

A  common  cause  of  overdispersion  is  subject  heterogeneity.  For  instance,  suppose  that 
width,  weight,  color,  and  spine  condition  are  the  four  predictors  that  affect  a  female  crab’s 
number  of  satellites.  Suppose  that  Y  has  a  Poisson  distribution  at  each  fixed  combination 
of  those  predictors.  Our  model  uses  width  alone  as  a  predictor.  Crabs  having  a  certain 
width  are  then  a  mixture  of  crabs  of  various  weights,  colors,  and  spine  conditions.  Thus, 


Table  4.4  Sample  Mean  and  Variance  of  Number  of  Satellites 


Width  (cm) 

Number  of 

Cases 

Number  of 

Satellites 

Sample 

Mean 

Sample 

Variance 

<23.25 

14 

14 

1.00 

2.77 

23.25-24.25 

14 

20 

1.43 

8.88 

24.25-25.25 

28 

67 

2.39 

6.54 

25.25-26.25 

39 

105 

2.69 

11.38 

26.25-27.25 

22 

63 

2.86 

6.88 

27.25-28.25 

24 

93 

3.87 

8.81 

28.25-29.25 

18 

71 

3.94 

16.88 

>29.25 

14 

72 

5.14 

8.29 
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the  population  of  crabs  having  that  width  is  a  mixture  of  several  Poisson  populations,  each 
having  its  own  mean  for  the  response.  This  heterogeneity  results  in  an  overall  response 
distribution  at  that  width  having  greater  variation  than  the  Poisson  predicts.  If  the  variance 
equals  the  mean  when  all  relevant  variables  are  controlled,  it  exceeds  the  mean  when  only 
one  is  controlled. 

Overdispersion  is  not  an  issue  in  ordinary  regression  with  normally  distributed  Y ,  because 
that  distribution  has  a  separate  variance  parameter  to  describe  variability.  For  binomial  and 
Poisson  distributions,  however,  the  variance  is  a  function  of  the  mean.  Overdispersion  is 
common  in  the  modeling  of  counts.  When  the  model  for  the  mean  is  correct  but  the  true 
distribution  is  not  Poisson,  the  ML  estimates  of  model  parameters  are  still  consistent  but 
standard  errors  are  incorrect.  We  next  introduce  an  extension  of  the  Poisson  GLM  that 
has  an  extra  parameter  and  accounts  better  for  overdispersion.  In  Section  4.7  we  present 
another  approach  for  this,  quasi-likelihood  inference. 


4.3.4  Negative  Binomial  GLMs 

The  negative  binomial  distribution  has  probability  mass  function 


f{y\k,n) 


r  (y  +  k)  /  k  y  /  _  k  y 
T(A-)T(y+  1)  \n  +  k)  V  H  +  k)  ' 


_y  —  0,  1,2,..., 


(4.13) 


where  k  >  0  and  n  >  0  are  parameters.  This  distribution  results  when,  given  the  mean, 
Y  has  a  Poisson  distribution,  but  the  mean  itself  varies  according  to  a  gamma  distribution 
with  shape  parameter  k  (Section  14.4). 

Notationally,  we’ll  find  it  simpler  to  parameterize  the  negative  binomial  distribution  in 
terms  of  ji  and  y  —  l/k.  Then,  Y  has 


E(Y)  =  fi,  var  (Y)  =  /i  +  yix2. 

The  index  y  >  0  is  a  type  of  dispersion  parameter.  As  y  — >  0,  var(  Y )  -*  /<  and  the  negative 
binomial  distribution  converges  to  the  Poisson.  Usually,  y  is  unknown.  Estimating  it  helps 
summarize  the  extent  of  overdispersion.  For  A  =  y  fixed,  we  can  express  (4.13)  in  natural 
exponential  family  form  (4.1).  Then,  a  model  with  negative  binomial  random  component 
is  a  GLM.  For  simplicity,  such  models  let  y  be  the  same  constant  for  all  observations  but 
treat  it  as  unknown. 

In  Section  14.4  we  present  more  detail  about  negative  binomial  GLMs.  We  illustrate  the 
model  here  for  the  horseshoe  crab  data  analyzed  above  with  Poisson  GLMs.  With  the  identity 
link  and  width  as  predictor,  the  Poisson  GLM  has  jl  =  —  1 1.53  +  0.55x  (SE  =  0.06  for  $). 
For  the  negative  binomial  GLM,  convergence  problems  are  caused  by  a  slightly  negative 
predicted  response  during  the  iterative  fitting  process  at  the  lowest  observed  width  value  of 
21  cm.  Without  that  observation,  the  ML  fit  is  /z  =  — 11 .47  -F  0.55x  (SE  =  0.12).  Moreover, 
y  =  1 .07,  so  at  a  predicted  jx,  the  estimated  var(T)  is  roughly  fi,  +  fi2,  compared  to  fi  for 
the  Poisson  GLM.  (The  fit  is  similar  to  that  of  the  geometric  distribution,  which  is  the 
special  case  of  the  negative  binomial  with  y  =  1.0.)  Although  fitted  values  are  similar  to 
the  Poisson  GLM,  the  greater  SE  for  fi  and  the  greater  estimated  var(T)  in  the  negative 
binomial  model  reflect  the  overdispersion  uncaptured  with  the  Poisson  GLM.  Further 
improved  fit  is  obtained  by  allowing  “zero-inflation,”  permitting  a  higher  fitted  count  at  0 
than  the  negative  binomial  model  allows. 
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4.3.5  Poisson  Regression  for  Rates  Using  Offsets 

Often  a  response  count  Yj  has  an  index  t,  such  that  its  expected  value  is  proportional  to  t, . 
For  instance,  this  index  could  be  an  amount  of  time  or  a  spatial  area  over  which  the  count  is 
made.  Then,  the  sample  rate  is  y,  /t,,  with  expected  value  )i,  /t, .  With  explanatory  variables 
x,  a  loglinear  model  for  the  expected  rate  has  form 


log  (in/ti)  =  a  +  fixxn  +  •  •  •  +  PpXip. 


(4.14) 


This  model  has  equivalent  representation 


log  /x,  —  log  =  a  +  P\Xj\  H - b  ppXjp. 


The  adjustment  term,  —  log  r(,  to  the  log  link  of  the  mean  is  called  an  offset.  The  fit 
corresponds  to  using  log  /,  as  a  predictor  on  the  right-hand  side  and  forcing  its  coefficient 
to  equal  1.0. 

For  model  (4.14),  the  expected  response  count  satisfies 


111  =  ti  exp(a  +  ffxn  +  •  ■  •  +  ppXip). 


The  mean  has  proportionality  constant  depending  on  the  value  of  Xj.  The  identity  link  is 
also  sometimes  useful.  The  model  is  then 


fJ-i/ti  =  o'  +  pixn  +  ■  ■  ■  +  fipXjp,  or  /x,  =  at,  +  p\xiX  tj  +  ■  ■  ■  +  ppxipt,. 

This  does  not  require  an  offset.  It  corresponds  to  an  ordinary  Poisson  GLM  using  identity 
link  with  no  intercept  and  with  explanatory  variables  ?, ,  x,  i  xipt, .  It  provides  additive, 

rather  than  multiplicative,  predictor  effects.  It  is  less  useful  with  several  predictors,  as  the 
fitting  process  may  fail  because  of  a  negative  fitted  count  at  an  x,  at  some  step  in  the 
iterative  process. 

4.3.6  Example:  Modeling  Death  Rates  for  Heart  Valve  Operations 

Laird  and  Olivier  (1981)  analyzed  patient  survival  after  heart  valve  replacement  operations. 
A  sample  of  109  patients  were  classified  by  type  of  heart  valve  (aortic,  mitral)  and  by  age 
(<55,  >55).  Follow-up  observations  occurred  until  the  patient  died  or  the  study  ended. 
Operations  occurred  throughout  the  study  period,  and  follow-up  observations  covered 
lengths  of  time  varying  from  3  to  97  months.  The  response  was  whether  the  subject  died 
and  the  follow-up  time.  For  subjects  who  died,  this  is  the  time  after  the  operation  until 
death;  for  the  others,  it  is  the  time  until  the  study  ended  or  the  subject  withdrew  from  it. 

Table  4.5  lists  the  numbers  of  deaths  during  the  follow-up  period,  by  valve  type  and  age. 
These  counts  are  the  first  layer  of  a  three-way  contingency  table  that  classifies  valve  type, 
age,  and  whether  died  (yes,  no).  The  subjects  not  tabulated  in  Table  4.5  were  not  observed 
to  die.  They  are  censored,  since  we  know  only  a  lower  bound  for  how  long  they  lived  after 
the  operation.  It  is  inappropriate  to  analyze  that  2x2x2  table  using  binary  GLMs  for 
the  probability  of  death,  since  subjects  had  differing  times  at  risk;  it  is  not  sensible  to  treat 
a  subject  who  could  be  observed  for  3  months  and  a  subject  who  could  be  observed  for 
97  months  as  identical  trials  with  the  same  probability.  To  use  age  and  valve  type  as 


GENERALIZED  LINEAR  MODELS  FOR  COUNTS  AND  RATES 


129 


Table  4.5  Data  on  Heart  Valve  Replacement  Operations 

Type  of  Heart  Valve 


Age  Aortic  Mitral 


<55 

Number  of  deaths 

4 

1 

Time  at  risk  (months) 

1259 

2082 

Death  rate 

0.0032 

0.0005 

55+ 

Number  of  deaths 

7 

9 

Time  at  risk  (months) 

1417 

1647 

Death  rate 

0.0049 

0.0055 

Source:  Reprinted  with  permission,  based  on  data  in  Laird  and  Olivier  (1981), 


predictors  in  a  model  for  frequency  of  death,  the  proper  baseline  is  not  the  number  of 
subjects  but  rather  the  total  time  that  subjects  were  at  risk.  Thus,  we  model  the  rate  of 
death. 

The  time  at  risk  for  a  subject  is  their  follow-up  time  of  observation.  For  a  given  age  and 
valve  type,  the  total  time  at  risk  is  the  sum  of  the  times  at  risk  for  all  subjects  in  that  cell 
(those  who  died  and  those  censored).  The  sample  rate,  also  shown  in  that  table,  divides  the 
number  of  deaths  by  total  time  at  risk,  in  months.  For  instance,  4  deaths  in  1259  months  of 
observation  occurred  for  younger  subjects  with  aortic  valve  replacement,  so  their  sample 
rate  is  4/ 1259  =  0.0032. 

We  now  model  effects  of  age  and  valve  type  on  the  rate.  Let  T,y  denote  the  number  of 
deaths  for  age  a-,  and  valve  type  v; ,  with  expected  value  \xtj  for  total  time  at  risk  Given 
tij,  the  expected  rate  is  /+y/r,y.  Let  a  be  an  indicator  variable  for  age,  with  a i  =  0  for  the 
younger  age  group  and  a2  =  1  for  the  older  group.  Let  v  be  an  indicator  variable  for  valve 
type,  with  v\  —  0  for  aortic  and  v2  =  1  for  mitral.  The  model 


log (.Hij/tjj)  =  a  +  Pi  a,-  +  p2vj  (4. 1 5) 

assumes  a  lack  of  interaction  in  the  effects  of  age  and  valve  type. 

Using  software  (as  shown  at  the  text  website),  we  treat  {T,y)  as  independent  Poisson 
variates  with  means  {/r,y},  conditional  on  (r,y{.  Table  4.6  presents  the  fitted  death  counts  and 
estimated  rates.  The  estimated  effects  are 

Pi  =  1.221  (SE  =  0.514),  $2  =  -0.330  (SE  =  0.438). 


Table  4.6  Fit  of  Poisson  Regression  Models  to  Table  4.5  on  Heart  Valve  Operation  Deaths 


Age 

Log  Link 

Identity  Link 

Aortic 

Mitral 

Aortic 

Mitral 

<55 

Number  of  deaths 

2.28 

2.72 

3.16 

1.19 

Death  rate 

0.0018 

0.0013 

0.0025 

0.0006 

55+ 

Number  of  deaths 

8.72 

7.28 

9.17 

7.48 

Death  rate 

0.0062 

0.0044 

0.0065 

0.0046 
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There  is  evidence  of  an  age  effect.  Given  valve  type,  the  estimated  death  rate  for  the  older 
age  group  is  exp(  1.221)  =  3.39  times  that  for  the  younger  age  group.  The  study  contains 
much  censored  data.  Of  the  109  patients,  only  21  died  during  the  study  period,  so  both 
effect  estimates  are  imprecise.  Note,  though,  that  the  analysis  uses  all  109  patients  through 
their  contributions  to  the  times  at  risk. 

Table  4.6  also  shows  the  fit  of  the  corresponding  model  with  identity  link, 

My  =  octij  +  frajtjj  +  fcvjtij 

Substantive  conclusions  are  similar.  The  estimate  p\  =  0.0040  (SE  =  0.0014)  represents 
the  estimated  difference  in  death  rates  between  the  older  and  younger  age  groups  for  each 
valve  type. 

4.3.7  Poisson  GLM  of  Independence  in  Two-Way  Contingency  Tables 

Poisson  loglinear  models  are  also  used  to  model  counts  in  ordinary  contingency  tables. 
We  illustrate  for  two-way  tables  with  independent  counts  (T,y)  having  Poisson  distributions 
with  means  {My}-  Suppose  that  (/i,y|  satisfy 

My  =  M<*/  P j  ' 

where  {a,  }  and  {/S7 }  are  positive  constants  satisfying  J2,  ai  =  J2j  Pj  =  1-  This  is  a  multi¬ 
plicative  model,  but  a  linear  predictor  for  a  GLM  results  using  the  log  link, 

log  iXjj  =  A  +  a*  +  p* ,  (4.16) 

where  A.  =  \ogp,a*  =  log  a, ,  ft*  =  log  fi, .  This  Poisson  loglinear  model  has  additive  main 
effects  of  the  two  classifications  but  no  interaction. 

Since  the  (Ty)  are  independent,  the  total  sample  size  J2  /  Yy  has  a  Poisson  distri¬ 
bution  with  mean  Ylj  E-ij  =  M-  Conditional  on  Hj  y>j  —  n>  the  cell  counts  have  a 
multinomial  distribution  with  probabilities  {jTy  =  =  a,  fi,  =  m+n+j\.  This  is  inde¬ 

pendence  between  the  two  categorical  variables.  In  fact,  in  Poisson  form,  independence 
is  the  loglinear  model  (4.16).  The  inferences  conducted  in  Chapter  3  about  independence 
in  two-way  contingency  tables  relate  to  GLMs,  either  Poisson  loglinear  models  or  corre¬ 
sponding  multinomial  models  that  fix  n  or  the  row  or  column  totals. 


4.4  MOMENTS  AND  LIKELIHOOD  FOR  GENERALIZED  LINEAR  MODELS 

Having  introduced  GLMs  for  binary  and  count  data,  we  now  turn  our  attention  to  the 
likelihood  equations  and  methods  for  fitting  GLMs.  The  remainder  of  this  chapter  is 
somewhat  technical,  providing  general  results  applying  to  the  modeling  methods  presented 
in  subsequent  chapters.  Some  readers  may  prefer  to  skip  this  material. 


4.4.1  The  Exponential  Dispersion  Family 

It  is  helpful  to  extend  the  notation  for  a  GLM  to  handle  many  distributions  that  have  a 
second  parameter.  The  random  component  of  the  GLM  specifies  that  the  N  observations 
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(yi,  . . . ,  yN)  onY  are  independent,  with  probability  mass  or  density  function  for  y,-  of  form 

f(yr,0i,<p )  =  exp{[y,0,  -  b(9i)]/a(<p)  +  c(y,,  $)}.  (4. 17) 

This  is  called  the  exponential  dispersion  family  and  0  is  called  the  dispersion  parameter 
(Jprgensen  1987).  The  parameter  9,  is  the  natural  parameter. 

When  4>  is  known,  (4.17)  simplifies  to  the  form  (4. 1 )  for  the  natural  exponential  family, 
which  is 


f(y,  ;0j)  =  a(0j  )b(yt )  exp[y,  Q(9, )] . 


We  identify  Q(9)  here  with  8/a(<p)  in  (4.17),  a(9)  with  exp [—b(0)/a(<p)]  in  (4.17),  and 
b(y)  with  exp[c(y,  </>)]  in  (4.17).  The  more  general  formula  (4.17)  is  not  needed  for  one- 
parameter  families  such  as  the  binomial  and  Poisson.  Usually,  a(cj))  has  form  a(4>)  =  </>/«, 
for  a  known  weight  to ,  .  For  instance,  when  y,  is  a  mean  of  n,  independent  readings,  such 
as  a  sample  proportion  for  n ,  Bernoulli  trials,  <p  =  1  and  ru,  =  nj  (Section  4.4.3). 

4.4.2  Mean  and  Variance  Functions  for  the  Random  Component 

Expressions  for  E{Y,)  and  var(F, )  use  terms  in  (4.17).  Let  L,  =  log  /(>>,•;  9,,  <p)  denote  the 
contribution  of  y,  to  the  log  likelihood,  so  the  log-likelihood  function  is  L  =  £7  L,.  From 
(4.17), 


U  =  [y/0,-  -  b(9i)]/a(<j>)  +  c(yf,  <p).  (4. 1 8) 


Therefore, 


dLi/dd,  =  [y,  -  b'm/aW),  d2Li/ddf  =  - b"{9i)/a (0), 

where  b'(9i )  and  b"(0, )  denote  the  first  two  derivatives  of  b(-)  evaluated  at  9,.  We  now  apply 
the  general  likelihood  results 


which  hold  under  regularity  conditions  satisfied  by  the  exponential  family  (Cox  and 
Hinkley  1974,  Sec.  4.8).  From  the  first  formula  applied  with  a  single  observation, 
E[Y,  -  b'(9j)]/a(<p)  =  0,  or 


Pi  =  E(Yj)  =  b'(6j). 


(4.19) 


From  the  second  formula, 

b"(9j)/a(<t>)  =  E[(Y,  -  b'(9j))/a(<p)]2  =  var(F,)/[a(0)]2, 


so  that 


var(y, )  =  h"(9i)a(<p). 


(4.20) 
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In  summary,  the  function  h(-)  in  (4.17)  determines  moments  of  Y,.  This  function  is 
called  the  cumulant  function,  since  when  a{<p)  —  1  its  derivatives  yield  the  cumulants  of 
the  distribution  (J0rgensen  1987). 

4.4.3  Mean  and  Variance  Functions  for  Poisson  and  Binomial  GLMs 

We  illustrate  the  mean  and  variance  expressions  for  Poisson  and  binomial  distributions. 
When  Yj  is  Poisson, 


f(y, ;  II, )  =  - =  exp(y,  log  [Aj  -  in 

Yi ! 

=  exp[y,0,  -  exp(0, )  -  log  y, !], 


log  y, !) 


where  0,-  =  log  p,,.  This  has  exponential  dispersion  form  (4.17)  with  b(9,)  =  exp (0,-), 
a(<p)  =  1,  and  c(y,,  0)  =  —  log  y,l.  The  natural  parameter  is  0,  —  log  /lc,  .  From  (4.19) 
and  (4.20), 


E(Y,)  =  b'{6i)  =  exp(0,)  =  g,, 
var(F,)  =  b"(9j)  =  exp  (0,)  =  iih 

Next,  suppose  that  «,T,  has  a  bin(«, ,  7r, )  distribution;  that  is,  here  y,  is  the  sample 
proportion  (rather  than  number)  of  successes,  so  E(Y, )  =  jr,  does  not  depend  on  n, .  Let  0,  = 
log[rr,/(l  -  7t\ )].  Then,  i r,-  =  exp(0,)/[l  +  exp(0,)]  and  log(l  -  7 r, )  =  -  log[l  +  exp(0,)]. 
Extending  (4.3),  we  have  the  result 


f{yi\n,,  nY)  =  ("‘y  j^,','V'(1  —  zr, )"'  n,y< 
~yjQ,  -  logfl  +  exp(0/ )] 


exp 


\/n, 


+  log 


( 

\n>yi 


(4.21) 


This  has  exponential  dispersion  form  (4.17)  with  b(d/)  =  log[l  +  exp(0,)],  a((p)  —  1  /«,, 
and  c(y,  ,  cp)  =  log  The  natural  parameter  is  the  logit,  0,  =  log[7r,  /(l  —  7T,  )J.  From 

(4.19)  and  (4.20), 


E(Yj)  =  b'ifii)  =  exp(0,)/[l  +  exp(0, )]  =  7T, , 
var(y, )  =  b"(9i)a(c/>)  -  exp(0,)/{[  1  +  exp(0,)]2«, )  =  7r,(l  -  7 r,)/«,. 


4.4.4  Systematic  Component  and  Link  Function  of  a  GLM 

For  observation  i,  the  systematic  component  of  a  GLM  relates  parameters  { t), )  to  the 
explanatory  variables  using  a  linear  predictor 

Pi  =  PjXjj, 
j 


i  =  l, ....  N. 
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In  matrix  form, 


V  =  Xfi, 


where  t)  =  (r)\ , . . . ,  rj^)T  and  p  =  (/3y,  P\ ,  $>,  •  •  -)T  is  the  column  vector  of  model  param¬ 
eters.  With  p  parameters  in  p.  X  is  the  N  x  p  matrix  of  explanatory  variable  values  for  the 
N  subjects.  In  ordinary  linear  models,  X  is  called  the  design  matrix.  It  need  not  refer  to  an 
experimental  design,  however,  and  the  GLM  literature  calls  it  the  model  matrix. 

The  GLM  links  ij,  to  p,  =  E(Yj)  by  a  link  function  g(-).  Thus,  p,  relates  to  the  explana¬ 
tory  variables  by 


rji  =  g(pa)  =  PjXiJ’  1  =  i’  •  ’  N- 
j 

The  link  function  g  for  which  g(pi)  =  0,  in  (4.17)  is  the  canonical  link.  For  it,  the  direct 
relationship 

Oi  =  ^  Pjxo 

j 

occurs  between  the  natural  parameter  and  the  linear  predictor. 

4.4.5  Likelihood  Equations  for  a  GLM 

For  N  independent  observations,  from  (4.18)  the  log  likelihood  is 

l (p)  =  YlLi  =  Xllog  f(yrA,<i>)  =  y'°'au)9l)  +  Ylc(yh^-  (4-22) 

i  i  /  i 

The  notation  L(p)  reflects  the  dependence  of  0  on  the  model  parameters  p. 

The  likelihood  equations  are 

dUpydpj  =  £  dLi/dPj  =  o 


for  all  j.  To  differentiate  the  log  likelihood  (4.22),  we  use  the  chain  rule, 


d Lj  dLj  d9,  dpi  dtp 

Wi  ~  Wi’dpi~BTuWj' 


Since  dLj/dOj  =  [y/  —  b'(0i)]/a((j)),  and  since  pi  =  b'(0j)  and  var(T,)  =  b"(0j)a(<l>)  from 
(4.19)  and  (4.20), 

dLi/dOi  =  (V/  -  Pi)/a((p),  dpi/dOj  =  fc"(0,)  =  var (Yj)/a{(j>). 

Also,  since  ?7,  =  J]  ■  PjXy, 


dtp/ dp  j  =  Xjj. 
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Finally,  since  t]i  =  g(Hi),  d/ii/diy  depends  on  the  link  function  for  the  model.  In  summary, 
substituting  into  (4.23)  gives  us 


dLj  _  y,  -  Hi  a{<p)  3/x,-  _  (y,  -  Hi)xij  9/n,- 

3  Pj  a(</>)  \ar(Y,)  dr)jX‘J  var(F,)  dry' 

Summing  over  the  N  observations  yields  the  likelihood  equations. 


yv  (yt  -  Hi)xjj  3 Hi 
var(F,)  3  m 


=  0, 


j  =0,1,2,.... 


(4.24) 


(4.25) 


Although  /?  does  not  appear  in  these  equations,  it  is  there  implicitly  through  Hi*  since 
Hi  =  g_1  ( J2j  fijxij)-  Different  link  functions  yield  different  sets  of  equations. 


4.4.6  The  Key  Role  of  the  Mean- Variance  Relationship 

Interestingly,  the  likelihood  equations  (4.25)  depend  on  the  distribution  of  Y,  only  through 
Hi  and  varfF,).  The  variance  itself  depends  on  the  mean  through  a  particular  functional 
form 


var(F,)  =  v(Hi) 


for  some  function  v.  For  example,  v(hi)  =  Hi  for  the  Poisson,  v(/u.,)  =  ^,(1  -  Hi)/ni  for 
the  binomial  proportion,  and  v(Hi)  —  <r2  (i  e.,  constant)  for  the  normal. 

When  Y,  has  distribution  in  the  natural  exponential  family,  the  relationship  between  the 
mean  and  the  variance  characterizes  the  distribution  (Jprgensen  1987).  For  instance,  if  K, 
has  distribution  in  the  natural  exponential  family  and  if  v(Hi)  —  Hi,  then  necessarily  Yj  has 
the  Poisson  distribution. 


4.4.7  Likelihood  Equations  for  Binomial  GLMs 

Suppose  that  n,  Y,  has  a  bin(«;,  i r,)  distribution.  We  use  the  binomial  parameterization  of 
Section  4.4.3,  so  y,  is  a  sample  proportion  of  successes  for  n,  trials.  The  binomial  GLM 
(4.8)  for  a  single  predictor  extends  with  several  predictors  to 


itj  = 


(4.26) 


where  <t>  is  the  standard  cdf  of  some  class  of  continuous  distributions.  Since  n,  =  Hi  = 
< Htli )  with  rp  -  J2  j  Pj-Xih 


3  Hi 
dip 


=  </>('/,)  =  (p 


where  <p{u)  =  d<t>(u)/du  [i.e.,  the  pdf  corresponding  to  the  cdf  <J>,  not  the  disper¬ 
sion  parameter  in  (4.17)].  Since  var(F,)  =  7T,(1  —  i r,)/«(,  the  likelihood  equations  (4.25) 


MOMENTS  AND  LIKELIHOOD  FOR  GENERALIZED  LINEAR  MODELS 


135 


simplify  to 


(4.27) 


where  n,  =  <J>(  J2j  Pjxij)- 

For  the  logit  link,  =  log[7r,/(l  -  jt,)],  so  dru/dnj  =  1/[tt,(1  —  7r,)j  and  9/r,73/j,  = 
d7Zi/dr]j  =  iij ( 1  —  tt,).  Then  the  likelihood  equations  (4.27)  simplify  to 


Y  n-(yi  -  ni)xu  = 

i 

where  7r,  =  4>(  J2j  Pjxij)  with  41  as  the  standard  logistic  cdf. 


(4.28) 


4.4.8  Asymptotic  Covariance  Matrix  of  Model  Parameter  Estimators 

The  likelihood  function  for  the  GLM  also  determines  the  asymptotic  covariance  matrix  of 
the  ML  estimator  /J.  This  matrix  is  the  inverse  of  the  information  matrix  J ,  which  has 
elements  E[- d2L(f))/9ph  dfij].  To  find  this,  for  the  contribution  Li  to  the  log  likelihood 
we  use  the  helpful  result 


°Li  /3LA  /3LA 

WhWj)  \dfij\afij)' 


,3  0h  dfij 

which  holds  for  exponential  families  (Cox  and  Hinkley  1974,  Sec.  4.8).  Thus, 

from  (4.24) 


(  ^  )  E 

~(Yi  -  (ii)xih  3 fii  ( Yi  -  ixi)xij  3 Hi' 

\9fa9Pj) 

var(F,)  drji  var(L,)  dr),  _ 

xihxij  f  3/r, 


var(L)  \3r, 


1i  / 


Since  L(p)  =  L,, 


E  (11\  =  ***&■  (^A2 

V  fifth  9 Pj  /  "  var(T,)  \  3»j,  / 


Let  W  be  the  diagonal  matrix  with  main-diagonal  elements 


w,  =  (9/r,/9r;,)2/var(K,).  (4.29) 

Then,  generalizing  from  the  typical  element  of  the  information  matrix  to  the  entire  matrix, 

J  =  XTWX.  (4.30) 


Note  that  the  form  of  W  and  hence  27"  depends  on  the  link  function. 
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The  asymptotic  covariance  matrix  of  0  is  estimated  by 

cov(j8)  =  J~'  =  (A rTWX)~',  (4.31) 

where  W  is  W  evaluated  at  0.  We’ll  see  an  example  for  Poisson  GLMs  next  and  for 
binomial  GLMs  in  Section  5.5. 

4.4.9  Likelihood  Equations  and  cov(/?)  for  Poisson  Loglinear  Model 

The  general  Poisson  loglinear  model  (4.4)  has  the  matrix  form 

log /a  =  X0. 


For  the  log  link,  rj,  =  log  /x, ,  so  /ti,  =  exp(t;, )  and  3/a,/3?;,  =  exp(r/j)  =  /a,.  Since 
var( Yj )  =  Hi,  the  likelihood  equations  (4.25)  simplify  to 

T>,  -  lM)Xjj  -  0.  (4.32) 


These  equate  the  sufficient  statistics  yixij  for  0  to  their  expected  values.  Also,  since 


Wi  =  (3/A,/3??,)2/var(T,)  =  /a, 

the  estimated  covariance  matrix  (4.3 1)  of  0  is  {XT  W X)~ 1 ,  where  W  is  the  diagonal  matrix 
with  elements  of  0  on  the  main  diagonal. 


4.5  INFERENCE  AND  MODEL  CHECKING 
FOR  GENERALIZED  LINEAR  MODELS 

For  most  GLMs  the  likelihood  equations  (4.25)  are  nonlinear  functions  of  ft.  For  now,  we 
defer  details  about  solving  them  for  the  ML  estimator  0  and  focus  instead  on  using  the  fit 
for  statistical  inference. 

The  Wald,  score,  and  likelihood-ratio  methods  introduced  in  Section  1.3.3  for  signifi¬ 
cance  testing  and  interval  estimation  apply  to  any  GLM.  Likelihood-ratio  inference  utilizes 
the  deviance  of  the  GLM. 


4.5.1  Deviance  and  Goodness  of  Fit 

From  Section  4.1.5,  the  saturated  GLM  has  a  separate  parameter  for  each  observation.  It 
gives  a  perfect  fit.  This  sounds  good,  but  it  is  not  a  helpful  model.  It  does  not  smooth  the 
data  or  have  the  advantages  that  a  simpler  model  has,  such  as  parsimony.  Nonetheless,  it 
serves  as  a  baseline  for  other  models,  such  as  for  checking  model  fit. 

A  saturated  model  explains  all  variation  by  the  systematic  component  of  the  model.  Let  0 
denote  the  estimate  of  0  for  the  saturated  model,  corresponding  to  estimated  means  /a,  =  y, 
for  all  i.  For  a  particular  unsaturated  model,  denote  the  corresponding  ML  estimates  by  6 
and  /a,  .  For  maximized  log  likelihood  L(ji\  y)  for  that  model  and  maximized  log  likelihood 
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L{y\  y)  in  the  saturated  case. 


-2  log 


maximum  likelihood  for  model 
maximum  likelihood  for  saturated  model 


=  -2[L(£;  y)  -  L(y;  y)] 


describes  lack  of  fit.  It  is  the  likelihood- ratio  statistic  fortesting  the  null  hypothesis  that  the 
model  holds  against  the  alternative  that  a  more  general  model  holds.  From  (4.22), 


-2  [UfL-y)-L(y-y)] 

=  2  ^  -  b(0j)]/a(<p)  -  2  £[y;0,  -  b0i)]/a(<P). 

i  i 


When  a((p)  in  (4.17)  has  the  form  a{<p)  =  <p/a>i,  this  statistic  equals 


2  <Oi[yi0i  -  o,)  -  b(§i )  +  b(9i)\/4>  =  D(y ;  £)/</>.  (4.33) 


This  is  called  the  scaled  deviance  and  D(y\fL)  is  the  deviance.  The  greater  the  scaled 
deviance,  the  poorer  the  fit.  For  some  GLMs  the  scaled  deviance  has  an  approximate 
chi-squared  distribution. 

4.5.2  Deviance  for  Poisson  GLMs 

For  Poisson  GLMs,  by  Section  4.4.3,  9 ,  =  log  /t,  and  />(<?,•)  —  exp(0,)  =  A/-  Similarly, 
i9,  =  log  y j  and  b(6t)  —  y,  for  the  saturated  model.  Also  a(cp)  =  1,  so  the  deviance  and 
scaled  deviance  (4.33)  equal 

D{y\  A)  —  2  ^[y,  log (y,7A,)  -  yi  +  A/]-  (4-34) 


When  a  model  with  log  link  contains  an  intercept  term,  the  likelihood  equation  (4.32) 
implied  by  that  parameter  is  J2i  yi  —  Hi  A/  -  Then  the  deviance  simplifies  to 


D(y\ A)  =  2  Y^yi  log(y,7A()- 


(4.35) 


For  two-way  contingency  tables,  substituting  cell  count  nu  for  y,  and  the  independence 
fitted  value  Ay  for  Ai.  this  reduces  to  the  G 2  statistic  (3.1 1)  in  Section  3.2.1.  For  a  Poisson 
or  multinomial  model  applied  to  a  contingency  table  with  a  fixed  number  of  cells  N , 
Section  16.3  shows  that  the  deviance  has  an  approximate  chi-squared  distribution  for 
large  {/*,■}. 

4.5.3  Deviance  for  Binomial  GLMs:  Grouped  Versus  Ungrouped  Data 

Now  consider  binomial  GLMs  with  sample  proportions  {y,  j  based  on  {«,  }  trials.  By  Sec¬ 
tion  4.4.3,  6,  =  log -  A ,)]  and  b(9j)  =  log[l  +  exp(A)]  =  -  log(l  -  ff,).  Similarly, 
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§1  =  log[y,/(l  —  y, )]  and  b(0j )  =  —  log(l  —  y,)  for  the  saturated  model.  Also,  a(tp)  — 
1  /«,,  so  <p  =  1  and  coj  =  The  deviance  (4.33)  equals 


2  V  n,  y,-  ( log  7-^ - log  — T'  —  )  +  log(l  -  y,)  -  log(l  -  A,) 

V  1  ~yi  1  -  ^ ;  / 


1  —  T/ 


E/2  /  V /  X — >  /2  /  7T  /  X — '  1 

log - 1 - 2  V  «,y,  log - -  +  2  V  tij  log  - — 

tii-ntyi  "  fij  —  tiiTZi  1- 


7T/ 


Eft;  V;  X — '  fi/  — 

n,-y;  log  — —  +  2  >  (n,  -  «,>’,)  log - 

«,JT,  — '  »■  — 


«/  - 


/!,-  —  M/TT, 


At  setting  /,  nty,  is  the  number  of  successes  and  («,  —  «,>’,)  is  the  number  of  failures, 
i  —  1 , ,N .  Thus,  the  deviance  is  a  sum  over  the  2N  cells  of  successes  and  failures  and 
has  the  same  form, 


D(y,H)  =  2  observed  x  log(observed/fitted), 


(4.36) 


as  the  deviance  (4.35)  for  Poisson  loglinear  models  with  intercept  term.  With  binomial 
responses,  it  is  possible  to  construct  the  data  file  as  expressed  here  with  the  counts  of 
successes  and  failures  at  each  setting  for  the  predictors,  or  with  the  individual  Bernoulli  0 
or  1  observations  at  the  subject  level.  The  deviance  differs  in  the  two  cases.  In  the  first  case 
the  saturated  model  has  a  parameter  at  each  setting  for  the  predictors,  whereas  in  the  second 
case  it  has  a  parameter  for  each  subject.  We  refer  to  these  as  grouped  data  and  ungrouped 
data  cases.  The  approximate  chi-squared  distribution  for  the  deviance  occurs  for  grouped 
data  but  not  for  ungrouped  data  (see  Exercises  4.5,  4. 1 8,  and  5.35).  With  grouped  data,  the 
sample  size  increases  for  a  fixed  number  of  settings  of  the  predictors  and  hence  a  fixed 
number  of  parameters  for  the  saturated  model. 


4.5.4  Likelihood-Ratio  Model  Comparison  Using  the  Deviances 

For  a  Poisson  or  binomial  model  denoted  by  M,  (p  =  1,  so  the  deviance  (4.33)  equals 


D(y\  ft.)  =  -2 [L(/t;  y)  -  L{y ;  y)]. 


(4.37) 


Consider  two  models,  Mo  with  fitted  values  jlo  and  M\  with  fitted  values  fL\,  with  M 0  a 
special  case  of  Mi .  Model  Mo  is  said  to  be  nested  within  Mi . 

Since  Mo  is  simpler  than  M| ,  a  smaller  set  of  parameter  values  satisfies  Mo  than  satisfies 
M 1 .  Maximizing  the  log  likelihood  over  a  smaller  space  cannot  yield  a  larger  maximum. 
Thus,  HjXq,  y)  <  L(flx ;  y),  and  it  follows  from  (4.37)  with  the  same  L(y;  y)  for  each  model 
that 


D(y;  Ai)  <  D(y;  p0). 
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Simpler  models  have  larger  deviances.  Assuming  that  model  M\  holds,  the  likelihood-ratio 
test  of  the  hypothesis  that  Mq  holds  uses  the  test  statistic 

-2[L(H0-y)  -  L(£,;.y)] 

=  —2 [L(£0;  y)  —  L(y,  y)]  -  {-2[L(ji{\  y)  -  L(y\ y)]} 

=  D(y;fi. o)  -  D(y;/rl). 

The  likelihood-ratio  statistic  comparing  the  two  models  is  simply  the  difference  between 
the  deviances.  This  statistic  is  large  when  Mo  fits  poorly  compared  to  M\ . 

In  fact,  since  the  part  in  (4.33)  involving  the  saturated  model  cancels,  the  difference 
between  deviances, 


D(y;fL0)  ~  D(y\ £,)  =  2^m,[.y,09i,  -%)  -  b(Q ,,•)  +  h(0 0, )] , 


also  has  the  form  of  the  deviance.  Under  regularity  conditions,  this  difference  has  approxi¬ 
mately  a  chi-squared  null  distribution  with  df  equal  to  the  difference  between  the  numbers 
of  parameters  in  the  two  models. 

For  binomial  GLMs  and  Poisson  loglinearGLMs  with  intercept,  from  expressions  (4.35) 
and  (4.36)  for  the  deviance,  the  difference  in  deviances  uses  the  observed  counts  and  the 
two  sets  of  fitted  values  in  the  form 

D(y;  Ao)  -  Dfyr/t,)  =  2^2  yi  log(/2i,7Ao/)- 


In  fact,  Simon  (1973)  showed  that  when  observations  have  distribution  in  the  natural 
exponential  family,  this  equals 

D(y;flo)-  D(r,  Ai)  =  2^£i,  log(A,, /£,,,)  (4.38) 


for  GLMs  using  the  canonical  link.3  In  the  rest  of  this  text,  we  denote  this  likelihood-ratio 
statistic  for  comparing  models  by  G2(M0\M\). 

With  binomial  responses,  the  test  comparing  models  is  the  same  whether  the  data  tile 
has  grouped  or  ungrouped  form.  The  saturated  model  differs  in  the  two  cases,  but  its  log 
likelihood  cancels  when  we  form  the  difference  between  the  deviances. 


4.5.5  Score  Tests  for  Goodness  of  Fit  and  for  Model  Comparison 

For  the  common  GLMs  having  variance  function  var(T,)  =  v(/r,)  with  cf>  —  1,  the  score 
statistic  for  testing  the  model  fit  has  the  generalized  Pearson  form  (Lovison  2005, 

'This  result,  which  follows  from  the  simple  form  of  the  likelihood  equations  for  such  models,  is  shown  for  Poisson 
loglinear  models  in  Section  10.2.3. 
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Smyth  2003) 


£ 

/ 


(y,  -  A  if 

v(A,) 


(4.39) 


For  Poisson  v,  ,  for  which  v(fi,)  =  Am  this  has  the  usual  Pearson  form  of 


Pearson  statistic  =  ^^(observed  —  fitted)2/fitted. 

When  y,  is  a  binomial  proportion  based  on  n,  trials,  for  which  v(A/)  =  v(: t,)  =  7r,(l  — 
ft,)/ Hi,  then  the  X 2  sum  (4.39)  over  the  N  binomial  success  observations  is  identical  to  a 
sum  over  the  2N  counts  of  successes  and  failures  that  also  has  this  Pearson  form  (see  also 
Section  6.2.1). 

For  two  nested  models,  the  Pearson  difference  X2{Mq)  —  X2(M\)  does  not  have  Pearson 
form.  It  is  not  even  necessarily  nonnegative.  A  more  appropriate  generalized  Pearson 
statistic  for  comparing  models  is  (Lovison  2005,  Rao  1961) 

X2(M0|M,)  =  £(Ai/  -  Ao/)2/v(Ao/).  (4.40) 


This  has  the  generalized  Pearson  form  with  { A  i / }  in  place  of  { y , } .  This  is  not  the  score 
statistic  for  comparing  the  models  unless  M ;  is  the  saturated  model.  However,  for  Poisson 
models  with  v(Ao/)  =  Ao/  (and  corresponding  binomial  and  multinomial  GLMs)  it  is  a 
quadratic  approximation  for  the  difference  (4.38)  between  the  deviances  and  has  the  same 
null  asymptotic  behavior. 

Let  X  be  the  model  matrix  for  the  full  model  and  let  V ( Ao )  be  the  diagonal  matrix  of 
estimated  variances  of  the  observations  under  the  simpler  model.  Then,  for  the  canonical 
link  case,  Lovison  (2005)  showed  that  the  score  statistic  for  comparing  models  has  a 
somewhat  different  extended  Pearson  form  comparing  the  two  sets  of  fitted  values, 

(A,  -  Ao)rX[XrV(A0)xr'x'(Ai  -  Ao)- 

Lovison  also  noted  that  that  statistic  bounds  below  X2(Mq\M \).  Pregibon  (1982)  gave  the 
score  statistic  in  the  more  general  case.  He  showed  also  that  the  score  statistic  is  a  difference 
between  Pearson  goodness-of-fit  statistics  for  the  models  in  which  the  statistic  for  the  full 
model  is  evaluated  at  fitted  values  that  result  from  the  first  step  of  an  iterative  fitting  process 
that  starts  at  the  ML  estimates  for  the  reduced  model. 


4.5.6  Residuals  for  GLMs 

When  a  GLM  fits  poorly  according  to  an  overall  goodness-of-fit  test,  examination  of 
residuals  highlights  where  the  fit  is  poor.  The  Pearson  residual  for  observation  i  is 

>V  ~  Ai 
vMA/) 


ei  = 


(4.41) 
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For  it,  J2i  ef  =  ^2’  £he  generalized  Pearson  X 2  statistic.  In  (4.33)  let  D(y\ (l)  =  JT  d,, 
where 


dj  =  2co,\y,(0,  -  Gj)  -  b(0j)  +  b(Gj)]. 


The  deviance  residual  is 


s/d,  x  sign(>',  -  A/), 


(4.42) 


for  which  the  sum  of  squares  is  the  deviance. 

For  instance,  for  a  Poisson  GLM,  the  Pearson  residual  is 

e,  =  (y,  -  A,)/\/Ad 

Consider  the  model  of  independence  for  two-way  contingency  tables.  For  cell  count  y,j  —  n  u 
and  independence  fitted  value  Ay,  the  Pearson  residual  has  the  form  (3.13).  Then,  JT  •  / 
is  the  Pearson  X 2  chi-squared  statistic  (3.10),  and  JT  Yljdij  =  G 2,  the  likelihood-ratio 
statistic  (3.11)  for  testing  independence. 

When  the  model  holds,  Pearson  and  deviance  residuals  are  less  variable  than  standard 
normal  because  they  compare  y,  to  the  fitted  mean  fi,  rather  than  the  true  mean  /x,  (e.g., 
the  denominator  of  (4.41)  estimates  [v(/x,)]l/2  =  [ var( Y,  —  /x,)ll/2  rather  than  [var(T,  — 
A;  )]l/2)-  When  X2  =  ,  ej  has  an  approximate  chi-squared  distribution  with  df  =  v,  X2  is 

asymptotically  comparable  to  the  sum  of  squares  of  v  (rather  than  N)  independent  standard 
normal  random  variables.  Thus,  when  the  model  holds,  £(£,  ef)/N  v/N  <  1. 

We  prefer  to  use  standardized  residuals,  which  divide  each  raw  residual  (y,  —  A/)  by  its 
standard  error.  Let  V  =  V{fi)  denote  the  diagonal  matrix  of  variances  of  the  observations. 
For  GLMs  we’ll  see  below  that  the  asymptotic  covariance  matrix  of  the  vector  of  raw 
residuals  is 


cov(y  -  A)  =  V  l/2[/  -  Har] V l/2, 


where  /  is  the  identity  matrix  and  Hat  is  the  generalized  hat  matrix , 

Hat  =  W'/2  X(XTWX)-'XTW'/2.  (4.43) 

[Recall  that  W  is  the  diagonal  matrix  with  elements  w,  =  (9/x,79?;,)2/var(F,).]  Let  hj 
denote  the  estimated  diagonal  element  of  Ha,  for  observation  /,  called  its  leverage.  Then, 
standardizing  by  dividing  y,  —  (ij  by  its  estimated  SE  yields  the  standardized  residual 


y'i  -  A  i  _  <?i 

VV(A;)(1  -  hi)  / 1  -  hi 


(4.44) 


For  Poisson  GLMs,  for  instance,  /-,  =  (v,  —  fi,)// A/(l  —  hi).  Pierce  and  Schafer  (1986) 
presented  standardized  deviance  residuals. 

In  linear  models  the  hat  matrix  is  so  named  because  Hat  x  y  projects  the  data  to  the 
fitted  values,  A  —  “mu-hat.”  For  GLMs  with  link  function  g,  a  corresponding  relation  holds 
for  a  linearized  approximation  for  g(y),  as  discussed  in  Section  4.6.4  and  Exercise  4.29. 
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As  in  ordinary  regression,  the  greater  an  observation’s  leverage,  the  greater  its  potential 
influence  on  the  fit.  The  leverages  fall  between  0  and  1  and  sum  to  the  number  of  model 
parameters.  Unlike  ordinary  regression,  the  hat  values  depend  on  the  fit  as  well  as  the  model 
matrix,  and  points  that  have  extreme  predictor  values  need  not  have  high  leverage. 

4.5.7  Covariance  Matrices  for  Fitted  Values  and  Residuals 

We  found  in  (4.31 )  that  the  asymptotic  covariance  matrix  of  /?  is  ( XT  WXf1.  Let  D  denote 
the  diagonal  matrix  with  elements  3/t,73/?,.  Then, 

W  =  DV~'D  and  V  =  DWlD. 

Since  the  vector  of  linear  predictor  estimated  values  relates  to  /?  by  rj  =  X $,  its 
asymptotic  covariance  matrix  is  X(XT  W X)~l  XT .  By  the  delta  method,  we  can  obtain 
the  asymptotic  covariance  matrix  of  fitted  values  from  this,  as 

cov(A)  =  DX(XTWX)  'XT  D. 

As  in  ordinary  linear  models,  we  can  exploit  the  decomposition 

(y -  ID  =  (y  -\A)  +  (A  -  /G, 

If  (y  —  jL)  is  asymptotically  uncorrelated  with  (fi  —  fi),  then  the  asymptotic 

cov(y  -  fi)  =  V  -cov(/i)  =  DW'D-DX(XtWX)~'XtD 

This  equals  V^2[I  —  Ha,]Vl/2  for  the  hat  matrix  given  in  (4.43). 

So,  why  is  (y  —  fi)  asymptotically  uncorrelated  with  (fi  —  fi),  thus  generalizing  the  exact 
orthogonal  decomposition  for  linear  models?  One  argument4  is  as  follows:  An  alternative 
asymptotically  unbiased  estimator  of  fi  is  fi,*  =  [/<•  +  L(y  —  fi)],  for  a  N  x  N  matrix  of 
constants  L.  But  such  an  estimator  cannot  be  asymptotically  more  efficient  than  the  ML 
estimator  fi.  Let  C  =  cov(y  —  fi,  fi)  and  consider  the  case  L  =  —  C[cov(y  — /i)]_l .  Then, 
direct  calculation  shows  that  the  asymptotic  covariance  matrix  of  fi*  is 

co  y(fi*)  —  cov(ji)  —  C[cov(y  —  ji)]~'CT . 

But  this  gives  the  contradiction  that  fi*  is  asymptotically  more  efficient  than  fi,  unless 
C  =0 

4.5.8  The  Bayesian  Approach  for  GLMs 

There  is  by  now  an  enormous  literature  on  the  Bayesian  approach  to  inference  using  GLMs, 
and  many  books  that  survey  the  Bayesian  approach  spend  considerable  time  on  GLMs.  For 
instance,  see  Dey  et  al.  (2000)  and  Christensen  et  al.  (2010). 

In  this  book,  we’ll  show  some  details  about  the  Bayesian  approach  as  we  present  the 
various  important  GLMs  for  categorical  data.  In  particular.  Section  7.2  presents  Bayesian 

4Thanks  to  Dr.  Gianfranco  Lovison  for  showing  me  this  argument. 
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methods  for  binomial  regression  models.  Section  8.6  presents  them  for  multinomial  mod¬ 
els,  and  Section  10.7  presents  them  for  Poisson  loglinear  models.  A  couple  of  general 
results  for  GLMs  are  that  (1)  model  parameters  in  models  for  categorical  data  are  more 
commonly  treated  with  normal  prior  distributions  than  conjugate  priors,  and  (2)  the  Jef¬ 
freys  prior  is  improper  for  most  GLMs  except  for  binary  regression  models  (Ibrahim 
and  Laud  1991). 


4.6  FITTING  GENERALIZED  LINEAR  MODELS 

How  do  we  find  the  ML  estimators  p  of  GLM  parameters?  The  likelihood  equations  (4.25) 
are  usually  nonlinear  in  p.  We  describe  a  general-purpose  iterative  method  for  solving 
nonlinear  equations  and  apply  it  in  two  ways  to  determine  the  maximum  of  a  likelihood 
function. 


4.6.1  Newton-Raphson  Method 

The  Newton-Raphson  method  is  an  iterative  method  for  solving  nonlinear  equations,  such 
as  equations  whose  solution  determines  the  point  at  which  a  function  takes  its  maximum. 
It  begins  with  an  initial  guess  for  the  solution.  It  obtains  a  second  guess  by  approximating 
the  function  to  be  maximized  in  a  neighborhood  of  the  initial  guess  by  a  second-degree 
polynomial  and  then  finding  the  location  of  that  polynomial’s  maximum  value.  It  then 
approximates  the  function  in  a  neighborhood  of  the  second  guess  by  another  second-degree 
polynomial,  and  the  third  guess  is  the  location  of  its  maximum.  In  this  manner,  the  method 
generates  a  sequence  of  guesses.  These  converge  to  the  location  of  the  maximum  when  the 
function  is  suitable  and/or  the  initial  guess  is  good. 

In  more  detail,  here’s  how  Newton-Raphson  determines  the  value  p  at  which  a  func¬ 
tion  Lip)  is  maximized.  Let  u1  =  (dLiP)/dPo,  dLiP)/dp j, . . .).  Let  H  denote  the  matrix 
having  entries  /?,,/,  =  d1  Lip)/dfiadfiti,  called  the  Hessian  matrix.  Let  «(,)  and  Hin  be  u  and 
H  evaluated  at  / J(,),  the  guess  t  for  p.  Step  t  in  the  iterative  process  (r  =0,  1,2,...) 
approximates  LiP)  near  p(!)  by  the  terms  up  to  second  order  in  its  Taylor  series 
expansion. 


L(fi)  %  L(p(,))  +  u(,)T{p  -  PU))  +  (i)  [p  -  p(n)T HU)(p  -  p (,)). 

Solving  dL(p)/dp  *=»  u(,)  +  H(,)ip  —  P(,))  —  0  for  p  yields  the  next  guess.  That  guess  can 
be  expressed  as 


pu+n  =  p«)  _  (//toy- 1  h(05  (4.45) 

assuming  that  H{t)  is  nonsingular.  (Computing  routines  use  standard  methods  for  solving 
the  linear  equations  rather  than  explicitly  calculating  the  inverse.) 

Iterations  proceed  until  changes  in  Lip(,))  between  successive  cycles  are  sufficiently 
small.  The  ML  estimator  is  the  limit  of  P(t)  as  t  — >  oo;  however,  this  need  not  happen  if 
L(p)  has  other  local  maxima  at  which  the  derivative  of  Lip)  equals  0.  In  that  case,  a  good 
initial  estimate  is  crucial.  To  help  understand  the  Newton-Raphson  method,  work  through 
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these  steps  when  f  has  a  single  element  (Exercise  4.30).  Then,  Figure  4.6  illustrates  a  cycle 
of  the  method,  showing  the  parabolic  (second-order)  approximation  at  a  given  step. 

The  convergence  of  /T'1  to  f  for  the  Newton-Raphson  method  is  usually  fast.  For  large 
t,  the  convergence  satisfies,  for  each  j, 

\f{'+u  -  $j\  <  C\p?  —  $j\ 2  for  some  c  >  0 

and  is  referred  to  as  second-order.  This  implies  that  the  number  of  correct  decimal  places 
in  the  approximation  roughly  doubles  after  sufficiently  many  iterations.  In  practice,  it  often 
takes  relatively  few  iterations  for  satisfactory  convergence. 

For  many  GLMs,  including  Poisson  models  with  log  link  and  binary  models  with  logit 
link,  with  full-rank  model  matrix  the  Hessian  is  negative  definite  and  the  log  likelihood  is 
a  strictly  concave  function.  Then  ML  estimates  of  model  parameters  exist  and  are  unique 
under  quite  general  conditions  (Wedderburn  1976). 

4.6.2  Fisher  Scoring  Method 

Fisher  scoring  is  an  alternative  iterative  method  for  solving  likelihood  equations.  It  resem¬ 
bles  the  Newton-Raphson  method,  the  distinction  being  with  the  Hessian  matrix.  Fisher 
scoring  uses  the  expected  value  of  this  matrix,  called  the  expected  information,  whereas 
Newton-Raphson  uses  the  Hessian  matrix  itself,  called  the  observed  information. 

Let  ff"]  denote  the  approximation  t  for  the  ML  estimate  of  the  expected  information 
matrix;  that  is,  J{,)  has  elements  —E  (d2L(f!)/df}a  dfh),  evaluated  at  j8(,).  The  formula  for 
Fisher  scoring  is 


j8<,+l)  =  pu)  +  uu) 


or 


Jv)Pv+l)  =  +  uu). 


(4.46) 
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Formula  (4.30)  showed  that  J  =  XT  WX,  where  W  is  the  diagonal  matrix  with  main- 
diagonal  elements  w>,  —  (dfj.j/dr]i)2/var(Yj).  Similarly,  J(,)  =  XT  where  W(n  is  W 

evaluated  at  p[l).  The  estimated  asymptotic  covariance  matrix  of  /J  [see  (4.3 1 )]  occurs 
as  a  by-product  of  this  algorithm  as  for  t  at  which  convergence  is  adequate.  From 

(4.25),  for  both  Fisher  scoring  and  Newton-Raphson,  the  score  function  u  has  elements 


3L{/i) 

3Pj 


E 


(y,-  -  iij)x,j  d/ij 

var(T,)  dr)j  ' 


(4.47) 


Using  the  matrix  D  =  diag {9/z,/9/j,}  introduced  in  Section  4.5.7,  we  see  that  the  GLM 
likelihood  equations  can  be  expressed  as 

u  =  XTWD~'(y  -  fi)  =  0.  (4.48) 


For  GLMs  with  a  canonical  link,  we’ll  see  (Section  4.6.5)  that  the  observed  and  expected 
information  are  the  same.  For  noncanonical  link  models,  Fisher  scoring  has  the  advantages 
that  it  produces  the  asymptotic  covariance  matrix  as  a  by-product,  the  expected  information 
is  necessarily  nonnegative  definite,  and  as  seen  next,  it  is  closely  related  to  weighted 
least-squares  methods  for  ordinary  linear  models.  However,  it  need  not  have  second-order 
convergence,  and  for  complex  models  the  observed  information  is  often  easier  to  calculate. 
Efron  and  Hinkley  ( 1 978),  developing  arguments  of  R.  A.  Fisher,  gave  reasons  for  preferring 
observed  information.  They  argued  that  its  variance  estimates  better  approximate  a  relevant 
conditional  variance  (conditional  on  statistics  not  relevant  to  the  parameter  being  esti  mated), 
it  is  “closer  to  the  data,”  and  it  tends  to  agree  more  closely  with  Bayesian  analyses. 


4.6.3  Newton-Raphson  and  Fisher  Scoring  for  Binary  Data 

In  the  next  three  chapters  we  use  the  Newton-Raphson  and  Fisher  scoring  methods  for 
binary  regression  models.  For  now,  we  illustrate  them  with  a  simpler  problem  for  which  we 
know  the  answer,  maximizing  the  log  likelihood  based  on  an  observation y  from  abin(«,  it) 
distribution. 

From  Section  1.3.2,  the  first  two  derivatives  of  L{n)  =  y  log7r  +  ( n  —  y)log(l  —  n)  are 
u  —  (y  —  n7T)/n(\  -n),  H  =  -[y / n2  +  (//  -  y)/(l  —  7 r)2]. 

Each  Newton-Raphson  step  has  the  form 


7r('+1)  =  7r(,) 


■V  n-y 

(7T(,))2  (1  —  7 r*'*)2 


y  —  //7r<,) 
7T(,)(1  —  7r(,)) 


This  adjusts  7 r(,)  up  if  y/n  >  7r(,)  and  down  if  y/n  <  7r(').  For  instance,  with  7r(0)  = 
you  can  check  that  7r(l)  =  y/n.  When  7T(,)  =  y/n,  no  adjustment  occurs  and  7r(,+  l)  =  y/n , 
which  is  the  correct  answer  for  if.  For  starting  values  other  than  adequate  convergence 
usually  takes  just  a  few  more  iterations. 
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From  Section  1.3.2,  the  information  is  n/[n{  \  —  7r)].  A  step  of  Fisher  scoring  gives 


7T 


«  +  l) 


n 

_7T(,)(  1  —  7T 


im 


—  7T(0) 
(I) 


y  — 

7T(f)(l  —  7 r(,)) 


This  gives  the  correct  answer  for  A  after  a  single  iteration  and  stays  at  that  value  for 
successive  iterations. 


4.6.4  ML  as  Iterative  Reweighted  Least  Squares 

A  relation  exists  between  weighted  least-squares  estimation  and  using  Fisher  scoring  to 
find  ML  estimates.  We  refer  here  to  the  general  linear  model  of  form 


z  =  xp  +  e. 


When  the  covariance  matrix  of  e  is  V,  the  weighted  least-squares  (WLS)  estimator  of  fi  is 

(XTV~'X T'XtV~'z. 

In  practice,  V  itself  must  usually  be  estimated  to  use  this  formula. 

From  J  =  X1  W X  and  expression  (4.48)  for  «,  it  follows  that,  in  (4.46), 

jU)pU)  +  u(n  =  (. xTw(,)x)p(n  +  xTw(,\D(,)r\y  -  it(0) 

=  XTWU)[XP(,)  +  (D(,)r\y  -  n(,))]  =  XTW(,)z(,\ 

where  z(,)  has  elements 


zf  = 


Ex‘jPj}  +  (a  -  A,'") 


a  (0  *>  +  \ y' 


H 


to 


9  n) 


(0  • 


Equations  (4.46)  for  Fisher  scoring  then  have  the  form 

(XTW0)X)Pu+')  =  XTWU)z{,). 


These  are  the  normal  equations  for  using  weighted  least  squares  to  fit  a  linear  model  for  a 
response  variable  z(0,  when  the  model  matrix  is  X  and  the  inverse  of  the  covariance  matrix 
is  'Wu\  The  equations  have  solution 


P(,+])  =  (^7'W(')Ar)-‘^rW(')z('). 
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The  vector  z(,)  in  this  formulation  is  an  estimated  linearized  form  of  the  link  function  g, 
evaluated  at  _y. 


g(yi)  ^  g(i4n)  +  (y,  -  p)n)g'(p]n) 


,r+(y,  -J'Y 8"' 


(0 


JO 


dn) 


(/) 


(4.49) 


This  adjusted  (or  “working”)  response  variable  z  has  element  i  approximated  by  for 
cycle  t  of  the  iterative  scheme.  That  cycle  regresses  z(,)  on  X  with  weight  (i.e.,  inverse 
covariance)  tF(,)  to  obtain  a  new  estimate  fi{,+i\  This  estimate  yields  a  new  linear  predictor 
value  =  Xfi[,+])  and  a  new  adjusted  response  value  z(,+  11  for  the  next  cycle.  The  ML 
estimator  results  from  iterative  use  of  weighted  least  squares,  in  which  the  weight  matrix 
changes  at  each  cycle.  The  process  is  called  iterative  reweighted  least  squares.  The  weight 
matrix  VF  used  in  co\(fi)  [see  (4.31)],  in  the  hat  matrix  (4.43),  and  in  Fisher  scoring  is  the 
inverse  covariance  matrix  of  the  linearized  form  z  =  Xfi  +  D~l{y  —  p)  of  ^(y). 

At  convergence. 


fi  =  (XTWX)-'XTWz, 

for  the  estimated  linearized  response  z  =  Xfi  +  D  \y  —  p).  Since 

1/  =  xfi  —  x(xTwxr'xTwz, 

X(Xr  VTA')- 'X  VF  =  Vk  (Ha,)W  ‘  is  a  sort  of  asymmetric  projection  adaptation  of 
the  hat  matrix  shown  in  (4.43).  Tutz  (2011,  Sec.  3.10)  noted  the  alternative  asymmetric 
projection, 


pL-H*Vl/2(H„)V-l/2(y-n). 


A  simple  way  to  begin  the  iterative  process  uses  the  data  .y  as  the  initial  estimate  of  p. 
This  determines  the  first  estimate  of  the  weight  matrix  W  and  hence  the  initial  estimate  of 
fi.  It  may  be  necessary  to  alter  some  observations  slightly  for  this  first  cycle  only  so  that 
g(y ),  the  initial  value  of  z,  is  finite.  For  instance,  when  g  is  the  log  link  applied  to  counts, 
a  count  of  y,  =  0  is  problematic,  so  we  could  set  y,  =  j.  This  is  not  a  problem  with  the 
model  itself,  since  the  log  applies  to  the  mean,  and  fitted  means  are  usually  strictly  positive 
in  successive  iterations. 

4.6.5  Simplifications  for  Canonical  Link  Functions 

Certain  simplifications  result  with  GLMs  using  the  canonical  link  function.  For  that  link, 

h,  =°i  =  Yh  Pjx‘j- 

j 

Often,  a((p)  in  the  density  or  mass  function  (4.17)  is  identical  for  all  observations,  such 
as  for  Poisson  GLMs  [a((j>)  =  1]  and  binomial  GLMs  with  each  =  1  [for  which 
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a(4>)  =  l/rij  —  1],  Then  the  part  of  the  log  likelihood  (4.22)  involving  both  parameters 
and  data  is  JT  y/#/.  which  simplifies  to 

E  ><  ( E  PjxuJ  =  E  ft  (E  yixoJ  ■ 

Sufficient  statistics  for  estimating  P  in  the  GLM  are  then 

EW  >=0,1,2,.... 


For  the  canonical  link. 


dfii/drn  =  diAj/dOj  =  dbfe^/dO,  =  fi"((9,). 

Since  var(y,  )  =  b"(0j)a((p),  the  contribution  (4.24)  to  the  likelihood  equation  for  /i7  sim¬ 
plifies  to 


3  Li 

Wj 


y i  - 

var(y, ) 


(y;  -  Mi)*,)- 

a(<j>) 


When  a((p)  is  identical  for  all  observations,  the  likelihood  equations  are 


Ew  =  E^-’  >  =  o,i,2,.... 

/  i 


(4.50) 


(4.51) 


This  equation  illustrates  a  fundamental  result:  For  GLMs  with  canonical  link ,  the  likelihood 
equations  equate  the  sufficient  statistics  for  the  model  parameters  to  their  expected  values. 
For  a  normal  distribution  with  identity  link,  these  are  the  normal  equations.  We  obtained 
these  for  Poisson  loglinear  models  in  (4.32)  and  for  binomial  logistic  regression  models 
(when  each  =  1)  in  (4.28). 

From  expression  (4.50)  for  9L, /3/Jy,  with  the  canonical  link  the  second  derivatives  of 
the  log  likelihood  have  components 

3 2Lj  _  x jj  /  dfXj  \ 

dPjdfii,  a(<p)  \  ‘6  ph  ) 

This  does  not  depend  on  the  observation  y, ,  so 

d2L{p)/dphdp,  =  E[d2L(P)/dphdpji 

That  is,  H  =  —fj.  and  the  Newton-Raphson  and  Fisher  scoring  algorithms  are  identical 
for  canonical  link  models  (Nelder  and  Wedderburn  1972). 
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4.7  QUASI-LIKELIHOOD  AND  GENERALIZED  LINEAR  MODELS 


As  noted  in  Section  4.4.5,  the  likelihood  equations 


E 


()>/  -  flj  )Xjj 

v(fj-i) 


=  0, 


7=0,1 . p. 


(4.52) 


for  aGLM  depend  on  the  assumed  distribution  for  T,  only  through  p,  and  v(/r,  ).  The  choice 
of  distribution  determines  the  mean-variance  relationship  v(p,). 


4.7.1  Mean- Variance  Relationship  Determines  Quasi-likelihood  Estimates 

Wedderburn  (1974)  proposed  an  alternative  approach,  quasi-likelihood  estimation ,  which 
assumes  only  a  mean-variance  relationship  rather  than  a  specific  distribution  for  Yj.  It  has 
a  link  function  and  linear  predictor  of  the  usual  GLM  form,  but  instead  of  assuming  a 
distributional  type  for  T,  it  assumes  only 


var (Y,-)  =  v(m) 


for  some  chosen  variance  function  v.  The  equations  that  determine  quasi-likelihood  esti¬ 
mates  are  the  same  as  the  likelihood  equations  (4.52)  for  GLMs.  They  are  not  likelihood 
equations,  however,  without  the  additional  assumption  that  {T,}  has  distribution  in  the 
natural  exponential  family. 

To  illustrate,  suppose  we  assume  that  the  {F,j  are  independent  with 

v(pi)  =  ph 

The  quasi-likelihood  (QL)  estimates  are  the  solution  of  (4.52)  with  v(/r.,)  replaced  by 
Pi.  Under  the  additional  assumption  that  {T,}  have  distribution  in  the  natural  exponential 
family,  these  estimates  are  also  ML  estimates.  That  case  is  simply  the  Poisson  distribution. 
Thus,  for  v(p)  =  M,  quasi-likelihood  estimates  are  also  ML  estimates  when  the  random 
component  has  a  Poisson  distribution. 

Wedderburn  suggested  using  the  estimating  equations  (4.52)  for  any  variance  function, 
even  if  it  does  not  occur  for  a  member  of  the  natural  exponential  family.  In  fact,  the 
purpose  of  the  quasi-likelihood  method  was  to  encompass  a  greater  variety  of  cases,  such 
as  discussed  next.  The  QL  estimates  have  asymptotic  covariance  matrix  of  the  same  form 
(4.31)  as  in  GLMs,  namely,  (A'rITA')_l  with  w,  =  (3/u)/3?;,)2/var(T,). 

4.7.2  Overdispersion  for  Poisson  GLMs  and  Quasi-likelihood 

For  count  data,  we’ve  seen  (Section  4.3.3)  that  the  Poisson  assumption  is  often  unrealistic 
because  of  overdispersion  —  the  variance  exceeds  the  mean.  This  suggests  an  alternative 
to  a  Poisson  GLM  in  which  the  mean-variance  relationship  has  the  form 

v(pt)  —  <ppi 

for  some  constant  <p.  The  case  <j>  >  1  represents  overdispersion  for  the  Poisson  model. 
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In  the  estimating  equations  (4.52)  with  v(/x,  )  =  <pHj,  (p  drops  out.  Thus,  the  equations 
are  identical  to  likelihood  equations  for  Poisson  models,  and  model  parameter  estimates 
are  also  identical.  Also, 


wi  =  (dfij /dr]j)2 /\ar(Yj)  =  (dm /dr];)2 /<pm , 


so  the  estimated  cov(/j)  =  (XT  IT .S')-1  is  cp  times  that  for  the  Poisson  model. 

When  a  variance  function  has  the  form  v(yu,)  =  <pv*(p.j),  usually  <p  is  also  unknown. 
However,  (p  is  not  in  the  estimating  equations.  Let  X 2  —  JT(>7  —  A;)2/v*(Ai).  a  gener¬ 
alized  Pearson  statistic  for  the  simpler  model  with  <p  =  1.  When  X2 /<p  is  approximately 
chi-squared  or  when  p.j  is  approximately  linear  in  A  with  v*(Ai)  close  to  v*(/t,),  then 
E(X2 /(p)  sa  N  —  p,  the  number  of  observations  minus  the  number  of  model  parameters 
p.  Hence,  E[X2/(N  —  p)]  «  (p.  Using  the  motivation  of  moment  estimation,  Wedderburn 
(1974)  suggested  taking  </>  =  X2/(N  —  p)  as  the  estimated  multiple  of  the  covariance 
matrix. 

In  summary,  this  quasi-likelihood  approach  for  count  data  is  simple:  Fit  the  ordinary 
Poisson  model  and  use  its  p  parameter  estimates.  Multiply  the  ordinary  standard  error 
estimates  by  J X2/(N  —  p). 

We  illustrate  for  the  horseshoe  crab  data  analyzed  with  Poisson  GLMs  in  Sec¬ 
tion  4.3.2.  With  the  log  link,  the  fit  using  the  crab’s  carapace  width  to  predict  the  number 
of  satellites  was  log  A  =  —3.305  +  0. 164.v,  with  SE  =  0.020  for  A  =  0. 164.  To  improve 
the  adequacy  of  using  a  chi-squared  statistic  to  summarize  fit,  we  use  the  satellite  totals 
and  fit  for  all  female  crabs  at  a  given  width,  to  increase  the  counts  and  fitted  values  rel¬ 
ative  to  those  for  individual  female  crabs.  The  N  —  66  distinct  width  levels  each  have 
a  total  count  y,  for  the  number  of  satellites  and  a  fitted  total  Ac  The  Pearson  statistic 
comparing  these  is  X2  =  174.3.  The  quasi-likelihood  adjustment  for  standard  errors  equals 
V174. 3/(66  —  2)  —  1.65.  Thus,  SE  —  1.65(0.020)  =  0.033  is  a  more  plausible  standard 
error  for  A  =  0. 164  in  this  prediction  equation. 

Alternative  ways  of  handling  overdispersion  include  mixture  models  that  allow  het¬ 
erogeneity  in  the  mean  at  fixed  settings  of  predictors  (Chapter  14).  For  count  data  these 
include  Poisson  GLMs  having  random  effects  and  negative  binomial  GLMs  that  result 
when  a  Poisson  parameter  itself  has  a  gamma  distribution. 


4.7.3  Overdispersion  for  Binomial  GLMs  and  Quasi-likelihood 

The  quasi-likelihood  approach  can  also  handle  overdispersion  for  counts  based  on  grouped 
binary  data.  Suppose  y,  is  the  sample  proportion  from  /?,  Bernoulli  trials  with  param¬ 
eter  7tj ,  i  =  1 ,  ....  A.  The  |y, }  may  exhibit  more  variability  than  the  binomial  allows 
because  of  heterogeneity,  with  observations  at  a  particular  setting  of  explanatory  variables 
having  success  probabilities  that  vary  according  to  values  of  unobserved  variables.  Or, 
extra  variability  could  occur  because  the  Bernoulli  trials  at  each  i  are  positively  correlated 
(Section  14.3.3). 

With  binomial  sampling  (i.e.,  independent,  identical  trials),  E ( Y, )  =  tt,  and  var(T,  ) 

"  i  1  —  Ttj)/tij.  A  simple  quasi-likelihood  approach  uses  the  alternative  variance  function 


V(7T,-)  =  07T,(1  -  7T,)/«,. 


(4.53) 
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Overdispersion  occurs  when  <f>  >  1 .  The  quasi-likelihood  estimates  are  the  same  as  ML 
estimates  for  the  binomial  model,  since  </>  drops  out  of  the  estimating  equations  (4.52).  As 
in  the  overdispersed  Poisson  case,  </>  enters  the  denominator  of  w,  .  Thus,  the  asymptotic 
covariance  matrix  multiplies  by  </>,  and  standard  errors  multiply  by  An  estimate  of  <p 
using  the  X2  fit  statistic  for  the  ordinary  binomial  model  with  p  parameters  is  X2/(N  —  p) 
(Finney  1947). 

Methods  like  these  that  use  estimates  from  ordinary  models  but  inflate  their  standard 
errors  are  appropriate  only  if  the  model  chosen  describes  well  the  structural  relationship 
between  the  mean  of  Y  and  the  predictors.  If  a  large  goodness-of-fit  statistic  is  due  to  some 
other  type  of  lack  of  fit,  such  as  failing  to  include  a  relevant  interaction  term,  making  an 
adjustment  for  overdispersion  will  not  address  the  inadequacy. 

For  counts  with  binary  data,  alternative  mechanisms  for  handling  overdispersion  include 
mixture  models  such  as  binomial  GLMs  with  random  effects  (Section  13.3)  and  models 
for  which  a  binomial  parameter  itself  has  a  beta  distribution  (Section  14.3).  These  are 
preferable,  because  they  correspond  to  an  actual  model.  By  contrast,  although  the  approach 
using  the  variance  formula  v (zr,- )  =  <p7ij(\  —  7 r, •)/«,-  has  the  advantage  of  simplicity,  it  has 
a  structural  problem  when  n,  =  1:  Necessarily  v(7T,)  =  7r,(1  —  i r,)  for  ungrouped  binary 
data,  and  only  </>  =  1  then  makes  sense. 


4.7.4  Example:  Teratology  Overdispersion 

Teratology  is  the  study  of  abnormalities  of  physiological  development.  Some  teratology 
experiments  investigate  effects  of  dietary  regimens  or  chemical  agents  on  the  fetal  devel¬ 
opment  of  rats  in  a  laboratory  setting.  Table  4.7  shows  results  from  one  such  study  (Moore 
and  Tsiatis  1991).  Female  rats  on  iron-deficient  diets  were  assigned  to  four  groups.  Rats 
in  group  1  were  given  placebo  injections,  and  rats  in  other  groups  were  given  injections 
of  an  iron  supplement;  this  was  done  weekly  in  group  4,  only  on  days  7  and  10  in  group 
2,  and  only  on  days  0  and  7  in  group  3.  The  58  rats  were  made  pregnant,  sacrificed  after 
three  weeks,  and  then  the  total  number  of  dead  fetuses  was  counted  in  each  litter.  Due  to 
unmeasured  covariates  and  genetic  variability  the  probability  of  death  may  vary  from  litter 
to  litter  within  a  particular  treatment  group. 


Table  4.7  Response  Counts  of  (Litter  Size,  Number  Dead)  for  58  Litters  of  Rats  in 
Low-Iron  Teratology  Study 


Group  1 :  Untreated  (low  iron) 

(10,  1)  (1 1,4)  (12,  9)  (4,  4)  (10,  10)  (11, 9)  (9,  9)  (11,  11)  (10.  10)  (10,  7)  (12,  12)  (10,  9)  (8,  8) 
(11,9)  (6,4)  (9,  7)  (14,  14)  (12,  7)  (11, 9)  (13,  8)  (14,  5)  (10,  10)  (12,  10)  (13,  8)  (10,  10)  (14,  3) 
(13,  13)  (4,  3)  (8,  8)  (13,  5)  (12,  12) 

Group  2:  Injections  days  7  and  10 

(10,  1)  (3,  1)  (13,  1)  (12,  0)  (14,  4)  (9,  2)  (13,  2)  (16,  1)  (11,0)  (4,  0)  (1.0)  (12,  0) 

Group  3:  Injections  days  0  and  7 
(8,  0)  (1 1,  1)  (14,  0)  (14,  1)  (1 1,  0) 

Group  4:  Injections  weekly 

(3,  0)  (13,  0)  (9,  2)  (17,  2)  (15,  0)  (2,  0)  (14,  1)  (8,  0)  (6,  0)  (17,  0) 


Source:  Moore  and  Tsiatis  (1991). 
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Let  yug)  denote  the  proportion  dead  of  the  fetuses  in  litter  i  in  treatment  group  g. 
Let  7 r,(„)  denote  the  probability  of  death  for  a  fetus  in  that  litter.  Consider  the  model  that 
treats  n/(g)_y/(?)  as  a  bin(«,(s),  Jingy)  variate,  where 


7r/(«)=7r«>  8=  1,2,  3,4. 

That  is,  the  model  treats  all  litters  in  a  particular  group  g  as  having  the  same  probability 
of  death  ng.  The  ML  fit  has  estimate  n,,  equal  to  the  overall  sample  proportion  of  deaths 
for  all  fetuses  from  litters  in  that  group.  These  equal  jf|  =  0.758  ( SE  =  0.024),  jtt  = 
0. 102  (SE  =  0.028),  773  =  0.034  (SE  =  0.024),  and  77-4  =  0.048  (SE  =  0.021),  where  for 
group  g,  SE  =  •  The  estimated  probability  of  death  is  considerably 

higher  for  the  placebo  group. 

For  litter  i  in  group  g,  is  a  fitted  number  of  deaths  and  «,(s)(  1  —  n„ )  is  a  fitted 

number  of  nondeaths.  Comparing  these  fitted  values  with  the  observed  counts  of  deaths 
and  nondeaths  in  the  N  =  58  litters  using  the  Pearson  statistic  gives  X 2  —  154.7  with 
df  —  58  —  4  =  54.  There  is  considerable  evidence  of  overdispersion.  With  the  quasi¬ 
likelihood  approach,  {7ft,}  are  the  same  as  the  binomial  ML  estimates;  however,  <p  = 
X2/(N  —  p)  —  154.7/(58  —  4)  =  2.86,  so  standard  errors  multiply  by  4>i/2  —  1.69. 

Even  with  this  adjustment  for  overdispersion,  strong  evidence  remains  that  the  proba¬ 
bility  of  death  is  substantially  higher  for  the  placebo  group.  For  instance,  a  95%  confidence 
interval  for  n\  —  772  is 

(0.758  -  0.102)  ±  1.96(1. 69)7(0.024)2  +  (0.028)2  or  (0.54,0.78). 

This  is  quite  a  bit  wider  than  the  Wald  interval  of  (0.59,  0.73)  for  comparing  independent 
proportions,  which  ignores  the  overdispersion. 


NOTES 

Section  4.1:  The  Generalized  Linear  Model 

4.1  Exponential  families:  Distribution  (4.1)  is  called  a  natural  (or  linear)  exponential  family  to 
distinguish  it  from  a  more  general  exponential  family  that  replaces  y  by  r(y)  in  the  exponential 
term.  For  other  generalizations,  see  Jprgensen  (1983,  1987).  Books  on  GLMs  and  related 
models  include  Aitkin  et  al.  (2009),  Fahrmeir  and  Tutz  (2001),  Lee  et  al.  (2006),  McCullagh 
and  Nelder  ( 1 989),  and  McCulloch  et  al.  (2008). 


Section  4.3:  Generalized  Linear  Models  for  Counts  and  Rates 

4.2  Poisson  GLMs:  For  further  discussion  of  Poisson  regression  and  related  models  for  count  data, 
see  Breslow  (1984,  1990),  Cameron  and  Trivedi  (1998),  Frame  (1983),  Hinde  (1982),  Lawless 
(1987),  and  Seeber  (2005)  and  references  therein. 

4.3  Rates/survival:  Consider  a  contingency  table  for  rate  data  (such  as  Table  4. 1 0  below)  in  which 
one  dimension  is  a  discrete  time  scale.  Holford  (1980)  and  Laird  and  Olivier  (1981)  showed 
that  Poisson  loglinear  models  and  likelihoods  for  this  table  are  equivalent  to  loglinear  hazard 
models  and  likelihoods  that  assume  piecewise  exponential  hazards  for  the  survival  times.  For 
short  time  intervals,  this  approach  is  essentially  nonparametric  and  is  a  discrete  version  of 


EXERCISES 


153 


the  Cox  proportional  hazards  model.  For  other  analyses  of  rate  data,  see  Breslow  and  Day 
(1987,  Sec.  4.5),  Frome  (1983),  and  Hoem  (1987).  Doksum  and  Gasko  (1990)  summarized 
the  connection  between  the  logistic  regression  model  and  the  proportional  odds  model  for  the 
analysis  of  survival  data.  Other  articles  dealing  with  loglinear  and  logistic  models  for  grouped 
survival  data  include  Aitkin  and  Clayton  (1980),  Aranda-Ordaz  (1983),  Larson  (1984),  Prentice 
and  Gloeckler  (1978),  Schluchter  and  Jackson  (1989),  Stokes  et  al.  (2012),  and  Thompson 
(1977). 


Section  4.5:  Inference  and  Model  Checking  for  Generalized  Linear  Models 

4.4  Pearson  statistics:  For  use  of  the  Pearson  statistic  and  related  statistics  for  model  comparison, 
see  Agresti  and  Ryu  (2010),  Haberman  (1977a),  Lovison  (2005),  Pregibon  (1982),  Rao  ( 1961 ), 
and  Smyth  (2003). 

4.5  Diagnostics:  McCullagh  and  Nelder  (1989,  Chap.  12)  discussed  model  checking  for  GLMs. 
Separate  diagnostics  are  useful  for  checking  the  adequacy  of  each  component.  For  a  family 
g(fi;  y)  of  link  functions  indexed  by  parameter  y ,  Pregibon  (1980)  showed  how  to  estimate 
y  giving  the  link  with  best  fit  and  how  to  check  the  adequacy  of  a  given  link  g(/z;yo). 
For  discussions  about  residuals,  see  also  Davison  (1991),  Green  (1984),  Pierce  and  Schafer 
(1986),  Pregibon  (1980,  1981 ),  and  Williams  (1987).  Pregibon  (1982)  showed  that  the  squared 
standardized  residual  is  the  score  statistic  for  testing  whether  the  observation  is  an  outlier. 
Davison  and  Hinkley  (1997,  Sec.  7.2)  discussed  bootstrapping  in  GLMs. 


Section  4.6:  Fitting  Generalized  Linear  Models 

4.6  IRLS/ML:  Fisher  (1935b)  introduced  the  Fisher  scoring  method  to  calculate  ML  estimates 
for  probit  models.  For  further  discussion  of  GLM  model  fitting  and  the  relationship  between 
iterative  reweighted  least  squares  and  ML  estimation,  see  Green  (1984),  Jprgensen  (1983), 
McCullagh  and  Nelder  (1989),  and  Nelder  and  Wedderburn  (1972).  Green  (1984),  Jprgensen 
(1983),  and  Palmgren  and  Ekholm  (1987)  also  discussed  this  relation  for  exponential  family 
nonlinear  models. 


Section  4.7:  Quasi-likelihood  and  Generalized  Linear  Models 

4.7  Quasi-likelihood:  For  more  on  quasi-likelihood,  see  Sections  12.3,  13.6.4,  and  14.3,  Breslow 
(1984,  1990),  Cox  (1983),  Firth  (1987),  Hinde  and  Demetrio  (1998),  Heyde  (1997),  Lee  et 
al.  (2006,  Chap.  3),  McCullagh  (1983),  McCullagh  and  Nelder  (1989),  Nelder  and  Pregibon 
(1987),  and  Wedderburn  (1974,  1976). 


EXERCISES 

Applications 

4.1  In  the  2000  U.S.  presidential  election.  Palm  Beach  County  in  Florida  was  the  fo¬ 
cus  of  unusual  voting  patterns  apparently  caused  by  a  confusing  “butterfly  ballot.” 
Many  voters  claimed  that  they  voted  mistakenly  for  the  Reform  Party  candidate, 
Pat  Buchanan,  when  they  intended  to  vote  for  Al  Gore.  Figure  4.7  shows  the  total 
number  of  votes  for  Buchanan  plotted  against  the  number  of  votes  for  the  Reform 
Party  candidate  in  1996  (Ross  Perot),  by  county  in  Florida. 
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Figure  4.7  Total  vote,  by  county  in  Florida,  for  Reform  Party  candidates  Buchanan  in  2000  and  Perot  in  1996. 


a.  In  county  i,  let  n,  denote  the  proportion  of  the  vote  for  Buchanan  and  let  x,  denote 
the  proportion  of  the  vote  for  Perot  in  1 996.  For  the  linear  probability  model  fitted 
to  all  counties  except  Palm  Beach,  A,  =  —0.0003  +  0.0304x,  .  Give  the  value  of 
P  in  the  interpretation:  The  estimated  proportion  vote  for  Buchanan  in  2000  was 
roughly  P%  of  that  for  Perot  in  1996. 

b.  For  Palm  Beach  County,  n,  =  0.0079  and  x,  —  0.0774.  Does  this  result  appear 
to  be  an  outlier?  Investigate,  by  finding  n\ /A, .  (George  W.  Bush  won  the  state 
by  537  votes  and,  with  it,  the  Electoral  College  and  the  election.  Other  ballot 
design  problems  played  a  role  in  110,000  disqualified  “overvote”  ballots,  in 
which  people  mistakenly  voted  for  more  than  one  candidate,  with  Gore  marked 
on  84,197  ballots  and  Bush  on  37,731 .  For  details,  see  A.  Agresti  and  B.  Presnell, 
Statist.  Sci.,  17:  436^140,  2003.) 

4.2  For  Table  3.8  with  scores  (0,  0.5,  1.5,  4.0,  7.0)  for  alcohol  consumption,  ML  fitting 

of  the  linear  probability  model  for  malformation  has  output: 

Parameter  Estimate  Std  Error  Wald  95%  Conf  Limits 

Intercept  0.0025  0.0003  0.0019  0.0032 

alcohol  0.001087  0.000727  -0.0003  0.0025 

a.  Interpret  the  model  parameter  estimates.  Use  the  fit  to  estimate  the  relative  risk 
of  malformation  that  compares  alcohol  consumption  levels  0  and  7.0. 

b.  Some  software  (such  as  R)  also  reports  ft  —  0.001087  but  instead  reports  SE  = 
0.000832.  Why  do  you  think  the  SE  value  is  different?  [Hint:  The  software  with 
output  shown  above  inverts  the  observed  information  matrix  to  obtain  the  standard 
errors.] 
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4.3  For  Table  4.2,  refit  the  linear  probability  model  or  the  logistic  regression  model  using 
the  scores  (a)  (0.  2, 4,  6),  (b)  (0,  1 , 2,  3),  and  (c)  ( 1 , 2,  3,  4).  Compare  $  for  the  three 
choices.  Compare  fitted  values.  Summarize  the  effect  of  linear  transformations  of 
scores,  which  preserve  relative  sizes  of  spacings  between  scores. 

4.4  For  the  data  shown  in  part  in  Table  4.3,  let  Y  =  1  if  a  crab  has  at  least  one  satellite, 
and  Y  =  0  otherwise.  Using  x  —  weight,  fit  the  linear  probability  model. 

a.  Use  ordinary  least  squares.  Interpret  the  parameter  estimates.  Find  the  estimated 
probability  at  the  highest  observed  weight,  5.20  kg.  Comment. 

b.  Try  to  fit  the  model  using  ML,  treating  Y  as  binomial.  [The  failure  is  due  to  a  fitted 
probability  falling  outside  the  (0,  1)  range.  The  fit  in  part  (a)  is  ML  for  a  normal 
random  component,  for  which  fitted  values  outside  this  range  are  permissible.] 

c.  Fit  the  logistic  regression  model.  Show  that  the  fitted  probability  at  a  weight  of 
5.20  kg  equals  0.9968. 

4.5  We  use  the  following  artificial  data  to  illustrate  comments  in  Section  4.5.3  about 
grouped  versus  ungrouped  binary  data: 

x  Number  of  trials  Number  of  successes 

0  4  1 

14  2 

2  4  4 

Denote  by  Mo  the  model  logit[P(T  =  1)]  =  a  and  by  M\  the  model  logit[/>(K  = 

1)]  =  a  +  fix.  Denote  the  maximized  log-likelihood  values  by  Lq  for  M(),  L\  for 
M\,  and  Ls  for  the  saturated  model.  Create  a  data  file  in  two  ways,  entering  the 
data  as  (i)  ungrouped  data:  =  I ,  /  =  I .... ,  12;  and  (ii)  grouped  data:  «,  =  4, 
/  =  1,2,3. 

a.  Fit  Mo  and  M\  for  each  data  file.  Report  Lq  and  L\  in  each  case.  Note  they  are 
the  same  for  each  form  of  data  entry. 

b.  Show  that  the  deviances  for  M y  and  M\  differ  for  the  two  forms  of  data  entry. 
Why  is  this?  [Hint:  How  many  parameters  are  in  the  saturated  model  for  each 
data  file?] 

c.  Show  that  the  difference  between  the  deviances  for  M0  and  M\  is  the  same  for 
each  form  of  data  entry.  Why  is  this?  (Thus,  for  testing  the  effect  of  .*,  it  does  not 
matter  how  you  enter  the  data,  but  it  does  matter  if  you  want  to  test  goodness  of 
fit.) 

4.6  For  Table  4.3,  Table  4.8  shows  SAS  output  for  a  Poisson  loglinear  model  fit  using 
x  =  weight  and  Y  —  number  of  satellites. 

a.  Use  to  describe  the  weight  effect.  Show  how  to  construct  the  reported  confidence 
interval. 

b.  Construct  a  Wald  test  that  Y  is  independent  of.*.  Interpret.  What  else  do  you  need 
to  conduct  a  likelihood-ratio  test  of  this  hypothesis? 

c.  Is  there  evidence  of  overdispersion?  Explain. 
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Table  4.8  SAS  Output  for  Exercise  4.6 


Criterion 

DF 

Value 

Deviance 

171 

560 . 8664 

Pearson  Chi-Square 

171 

535 .8957 

Log  Likelihood 

71 . 9524 

Parameter 

Estimate 

Std  Error  Wald  95% 

Conf  Limits 

Chi-Sq 

Pr  >  ChiSq 

Intercept 

-0.4284 

0.1789  -0.7791 

-0.0777 

5 . 73 

0 . 0167 

weight 

0.5893 

0.0650  0.4619 

0 . 7167 

82 . 15 

<  .0001 

4.7  An  experiment  analyzes  imperfection  rates  for  two  processes  used  to  fabricate  silicon 
wafers  for  computer  chips.  For  treatment  A  applied  to  10  wafers,  the  numbers  of 
imperfections  are  8,  7,  6,  6,  3,  4,  7,  2,  3,  4.  Treatment  B  applied  to  10  other  wafers 
has  9,  9,  8,  14,  8,  13,  1 1,  5,  7,  6  imperfections.  Treat  the  counts  as  independent 
Poisson  variates  having  means  Ma  and  // B . 

a.  Fit  the  model  log  /x  =  a  +  fix,  where  x  —  1  for  treatment  B  and  x  —  0  for 
treatment  A.  Show  that  exp(/3)  =  Mb/Ma>  and  interpret  its  estimate. 

b.  Test  Hq :  ma  =  pB  using  the  Wald  or  likelihood-ratio  test  of  Hq:  ft  =  0.  Interpret. 

c.  Construct  a  95%  confidence  interval  for  Mb/Ma-  [Hint:  First  construct  one 
for  p.\ 

d.  Test  Hq:  ma  =  Mb  based  on  this  result:  If  F|  and  Yi  are  independent  Poisson 
with  means  mi  and  M2,  then  given  n  =(F|  +  Y2),  Y\  is  bin(n,7r)  with  n  — 
Mi/(Mi  +  M2L 

4.8  Refer  to  Exercise  4.7.  The  sample  mean  and  variance  are  5.0  and  4.2  for  treatment 
A  and  9.0  and  8.4  for  treatment  B. 

a.  Is  there  evidence  of  overdispersion  for  the  Poisson  model?  Explain.  Fit  the  neg¬ 
ative  binomial  loglinear  model.  Note  that  the  estimated  dispersion  parameter  is 
0  and  that  estimates  of  treatment  means  and  standard  errors  are  the  same  as  with 
the  Poisson  loglinear  GLM. 

b.  For  the  overall  sample  of  20  observations,  the  sample  mean  and  variance  are 
7.0  and  10.2.  Fit  the  loglinear  model  having  only  an  intercept  term  under  Pois¬ 
son  and  negative  binomial  assumptions.  Compare  results,  and  compare  confi¬ 
dence  intervals  for  the  overall  mean  response.  Why  do  they  differ?  [Note:  This 
shows  how  the  Poisson  model  can  lose  validity  when  an  important  covariate  is 
unobserved.] 

4.9  For  the  negative  binomial  model  fitted  to  the  crab  satellite  counts  with  log  link  and 
width  predictor,  0  =  -4.05,  0  =  0.192  ( SE  =  0.048),  y  =  1.106  (SE  =  0.197). 
Interpret.  Why  is  SE  for  0  so  different  from  SE  =  0.020  for  the  corresponding 
Poisson  GLM  in  Section  4.3.2?  Which  is  more  appropriate?  Why? 

4.10  Fit  the  rate  model  (4.15)  for  the  heart  valve  operations. 

a.  Find  the  95%  profile  likelihood  confidence  interval  for  P\,  and  show  that  it 
translates  to  (1.32,  10.39)  for  the  true  multiplicative  effect  exp(/li). 
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b.  Show  that  the  deviance  comparing  {_y,y}  with  {//,,}  is  G2  =  3.22,  with  residual 
df  =  1.  Show  that  G 1  =  1.09  for  the  corresponding  model  with  identity  link. 

c.  What  is  the  effect  on  the  model  parameter  estimates,  SE  values,  and  the  deviance 
when  (i)  the  times  at  risk  are  doubled,  but  the  numbers  of  deaths  stay  the  same; 
(ii)  the  times  at  risk  stay  the  same,  but  the  numbers  of  deaths  double;  and  (iii)  the 
times  at  risk  and  the  numbers  of  deaths  both  double? 


4.11  Table  4.9  is  based  on  a  study  with  British  doctors. 

a.  For  each  age,  find  the  sample  coronary  death  rates  per  1000  person-years  for 
nonsmokers  and  smokers.  To  compare  them,  take  their  ratio  and  describe  its 
dependence  on  age. 

b.  Fit  a  main-effects  model  for  the  log  rates  using  age  and  smoking  as  factors.  In  dis¬ 
cussing  lack  of  fit,  show  that  this  model  assumes  a  constant  ratio  of  nonsmokers’ 
to  smokers’  coronary  death  rates  over  age. 

c.  From  part  (a),  explain  why  it  is  sensible  to  add  a  quantitative  interaction  of  age 
and  smoking.  For  this  model,  show  that  the  log  ratio  of  coronary  death  rates 
changes  linearly  with  age.  Assign  scores  to  age,  fit  the  model,  and  interpret. 


Table  4.9  Data  for  Exercise  4.11  on  Coronary  Death  Rates 


Age 

Person-Years 

Coronary  Deaths 

Nonsmokers 

Smokers 

Nonsmokers 

Smokers 

35-44 

18,793 

52,407 

2 

32 

45-54 

10,673 

43,248 

12 

104 

55-64 

5710 

28,612 

28 

206 

65-74 

2585 

12,663 

28 

186 

75-84 

1462 

5317 

31 

102 

Source:  R.  Doll  and  A.  Bradford  Hill,  Natl.  Cancer  Inst.  Monogr.  19:  205-268.  1966.  See  also  N.  R.  Breslow  in 
A  Celebration  of  Statistics,  ed.  A.  C.  Atkinson  and  S.  E.  Fienberg.  New  York:  Springer- Verlag,  1985. 


4.12  Table  4.10  describes  survival  for  539  males  diagnosed  with  lung  cancer.  The  prog¬ 
nostic  factors  are  histology  ( H )  and  state  ( S )  of  disease.  The  assumption  of  a  constant 
rate  over  time  is  often  not  sensible,  and  this  study  divided  the  time  scale  (T)  into 
two-month  intervals  and  let  the  rate  vary  by  the  time  interval.  Let  ji,jk  denote  the 
expected  number  of  deaths  and  the  total  time  at  risk  for  histology  i  and  state 
of  disease  j,  in  follow-up  time  interval  k.  Analyses  suggested  a  lack  of  interaction 
between  T  and  either  prognostic  factor  (i.e.,  such  proportional  hazards  models  have 
the  same  effects  of  H  and  S  for  each  time  interval). 

a.  The  main  effects  model 

log  (Pijk/tijk)  =  a  +  +  pSj  +  pi 

has  deviance  43.9.  Explain  why  df  =  52.  Does  the  model  seems  to  fit  adequately  ? 

b.  For  this  model,  interpret  the  estimated  effects  of  S,  p%  —  p ;s  =  0.470  (SE  = 

0.174),  Pi  -  =  1.324  (SE  =  0.152). 

c.  The  model  that  adds  an  S  x  H  interaction  term  has  deviance  41 .5  with  df  =  48. 
Test  whether  a  significantly  improved  fit  results  by  allowing  this  interaction. 
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Table  4.10  Data  on  Number  of  Deaths  from  Lung  Cancer  for  Exercise 


Follow-up 

Time 

Interval 

(months) 

Disease 

Stage: 

Histology" 

I 

II 

III 

1 

2 

3 

1 

2 

3 

1 

2 

3 

0-2 

9 

12 

42 

5 

4 

28 

1 

1 

19 

(157 

134 

212 

77 

71 

130 

21 

22 

101) 

2-4 

2 

7 

26 

2 

3 

19 

1 

1 

1 1 

(139 

110 

136 

68 

63 

72 

17 

18 

63) 

4-6 

9 

5 

12 

3 

5 

10 

1 

3 

7 

(126 

96 

90 

63 

58 

42 

14 

14 

43) 

6-8 

10 

10 

10 

2 

4 

5 

1 

1 

6 

(102 

86 

64 

55 

42 

21 

12 

10 

32) 

8-10 

1 

4 

5 

2 

2 

0 

0 

0 

3 

(88 

66 

47 

50 

35 

14 

10 

8 

21) 

10-12 

3 

3 

4 

2 

1 

3 

1 

0 

3 

(82 

59 

39 

45 

32 

13 

8 

8 

14) 

12+ 

1 

4 

1 

2 

4 

2 

0 

2 

3 

(76 

51 

29 

42 

28 

7 

6 

6 

10) 

“Values  in  parentheses  represent  total  follow-up  time  at  risk. 

Source:  Reprinted  from  Holford  (1980)  with  permission  from  the  Biometric  Society. 


4.13  Table  4.1 1  shows  the  three-point  shooting,  by  game,  of  Ray  Allen  of  the  Boston 
Celtics  during  the  2010  NBA  (basketball)  playoffs.  Commentators  remarked  that 
his  shooting  varied  dramatically  from  game  to  game.  In  game  /,  suppose  that  Y,  = 
number  of  three-point  shots  made  out  of  n /  attempts  is  a  binfn, ,  jr,)  variate  and  the 
{T,  }  are  independent. 

a.  Fit  the  model,  jr,  =  a,  and  find  and  interpret  a  and  its  standard  error.  Does  the 
model  appear  to  fit  adequately?  [Note:  You  could  check  this  with  a  small-sample 
test  of  independence  of  the  24  x  2  table  of  game  and  the  binomial  outcome.] 


Table  4.11  Data  for  Exercise  4.13  on  Basketball  Shooting 


Game 

Number 

Made 

Number  of 
Attempts 

Game 

Number 

Made 

Number  of 
Attempts 

Game 

Number 

Made 

Number  of 
Attempts 

1 

0 

4 

9 

1 

8 

17 

3 

7 

2 

7 

9 

10 

6 

9 

18 

0 

2 

3 

4 

11 

11 

0 

5 

19 

8 

1 1 

4 

3 

6 

12 

2 

5 

20 

0 

8 

5 

5 

6 

13 

0 

5 

21 

0 

4 

6 

2 

7 

14 

2 

4 

22 

0 

4 

7 

3 

7 

15 

5 

7 

23 

2 

5 

8 

0 

1 

16 

1 

3 

24 

2 

7 

Source:  boston  .  stats  .  com/nba. 
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b.  Adjust  the  standard  error  for  overdispersion.  Using  the  original  SE  and  its  cor¬ 
rection,  find  and  compare  95%  confidence  intervals  fora.  Interpret. 

c.  Describe  a  factor  that  could  cause  overdispersion.  [Hint:  Is  it  realistic  to  treat  the 
success  probability  as  identical  from  shot  to  shot?] 

4.14  Refer  to  Table  14.6.  Fit  a  loglinear  model  with  an  indicator  variable  for  race,  (a) 
assuming  a  Poisson  distribution,  and  (b)  allowing  overdispersion  with  a  quasi¬ 
likelihood  approach.  Compare  results. 


Theory  and  Methods 

4.15  Describe  the  purpose  of  the  link  function  of  a  GLM.  Explain  why  the  identity  link 
is  not  often  used  with  binomial  or  Poisson  responses. 

4.16  For  binary  data,  define  a  GLM  using  the  log  link.  Show  that  effects  refer  to  the 
relative  risk.  Why  do  you  think  this  link  is  not  often  used?  [Hint:  What  happens  if 
the  linear  predictor  takes  a  positive  value?] 

4.17  For  the  logistic  regression  model  (4.6)  with  fi  >  0,  show  that  (a)  as  x  — >  oo,  7r(x)  is 
monotone  increasing,  and  (b)  the  curve  for  7r(x)  is  the  cdf  of  a  logistic  distribution 
having  mean  —a/fi  and  standard  deviation  7t/(\fi\V3). 

4.18  Let  Yj  be  a  bin(«,,  n,)  variate  for  group  i,i  =  1 , ....  A,  with  (T,  )  independent.  For 
the  model  that  tt\  =  ■  •  •  =nN,  denote  that  common  value  by  n.  For  observations 
{y,  },  show  that  ft  =  (X,  y<)  /  (X,  ni)-  When  all  n,  =  1,  for  testing  this  model’s 
fit  in  the  A  x  2  table,  show  that  X2  —  N .  Thus,  goodness-of-fit  statistics  can  be 
completely  uninformative  for  ungrouped  data.  (See  also  Exercise  5.35.) 

4.19  Suppose  that  T,  is  Poisson  with  £(/x,)  =  a  +  fix,,  where  x,  =  1  for  i  =  1, . . . ,  «a 
from  group  A  and  x,  =  0  for  i  =  nA  +  1 ,  •  •  • ,  «a  +  na  from  group  B.  Show  that  for 
any  link  function  g ,  the  GLM  likelihood  equations  (4.25)  imply  that  fitted  means  /2a 
and  /2b  equal  the  sample  means. 

4.20  A  method  for  negative  exponential  modeling  of  survival  times  relates  to  the  Poisson 
loglinear  model  for  rates  (Aitkin  and  Clayton  1980).  Let  T  denote  the  time  to  some 
event,  with  pdf /  and  cdf  F.  For  subject  i,  let  w,  =  1  for  death  and  0  for  censoring, 
and  let  T  —  X!,-  6  and  W  =  X,  wi- 

a.  Explain  why  the  survival-time  log  likelihood  for  n  independent  observations  is 
L(X)  =  ]Tw,  log[/(/,)]  +  ]T(1  -  w,)log[l  -  F(/,)]. 


(This  actually  applies  only  for  noninformative  censoring  mechanisms.)  Assuming 
f(t)  —  Xexp(-Xt),  show  that  X  =  W/T .  Conditional  on  T,  explain  why  W  has 
a  Poisson  distribution  with  mean  TX,  and  using  the  Poisson  likelihood  show  that 
X  =  W/T. 
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b.  The  hazard  function  represents  the  instantaneous  rate  of  death  for  subjects  who 
have  survived  to  time  t.  Suppose  that  h{t\ x)  =  A  exp (fiT  x).  With  parameter  X  in 
f{t)  replaced  by  Aexp(/?rjr)  and  with  /r,  =  tjXexp(f}T x,),  show  that  L  simplifies 
to 

L(X,  p)  =  Wj  log  tLi  -  Yl,  I1'  ~  X] w< log  ' 
i  i  i 

c.  Explain  why  maximizing  L(X,  fi)  is  equivalent  to  maximizing  the  likelihood  for 
the  Poisson  loglinear  model 

log  Mi  -  log  tj  =  log  A  +  pTXi 

with  offset  log(r, ),  using  “observations”  {w>,  }. 

d.  When  we  sum  terms  in  L  for  subjects  having  a  common  value  of  x,  explain  why 
the  observed  data  are  the  numbers  of  deaths  wf)  at  each  setting  of  x,  and  the 
offset  is  the  log  of  f,)  at  each  setting. 

4.21  A  binomial  GLM  7r,  =  <t>(  J [A  PjXjj)  with  arbitrary  inverse  link  function  4>  assumes 
that  njYj  has  a  bin(«,,  777)  distribution.  Find  w ,  in  (4.29)  and  hence  cov(j8).  For 
logistic  regression,  show  that  w,  =  «, zr, ( 1  -  717). 

4.22  For  the  class  of  binary  models  (4.8)  and  (4.9),  suppose  the  standard  cdf  corresponds 
to  a  pdf  <j>  that  is  symmetric  around  0. 

a.  Show  that  x  at  which  nix)  =  0.50  is  x  =  -a/f. 

b.  Show  that  the  rate  of  change  in  nix)  when  nix)  =  0.50  is  f(pi  0).  Show  this  is 
0.25  f  for  the  logit  link  and  f/V2n  (where  n  =3.14...)  for  the  probit  link. 

c.  Show  that  the  probit  regression  curve  for  f  >  0  has  the  shape  of  a  normal  cdf 
with  mean  —a/f  and  standard  deviation  \/\f\. 

4.23  For  binary  observations,  consider  the  model  nix)  —  j  +  (l/7r)tan_l(o;  +  fix). 

a.  Show  that  this  corresponds  to  a  cdf  of  a  distribution  for  which  the  standard 
version  is  the  Cauchy.  When  would  you  expect  a  GLM  using  this  curve  to  be 
more  appropriate  than  logistic  regression? 

b.  Explain  how  this  model  generalizes  to  a  family  of  models  for  which  the  link 
function  is  the  inverse  of  the  cdf  of  a  t  distribution  for  some  df  value,  the  probit 
resulting  as  df  -»  00. 

4.24  A  GLM  has  parameter  f  with  sufficient  statistic  S.  A  goodness-of-fit  test  statistic 
T  has  observed  value  ta.  If  f  were  known,  aP-value  is  P  =  P{T  >  t„\f).  Explain 
why  P(T  >  C|S)  is  the  uniform  minimum  variance  unbiased  estimator  of  P. 

4.25  Find  the  form  of  the  deviance  residual  (4.42)  for  an  observation  in  a  (a)  binomial 
GLM,  and  (b)  Poisson  GLM.  Illustrate  part  (b)  for  a  cell  count  in  a  two-way 
contingency  table  for  the  model  of  independence. 
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4.26  Let  yjj  be  observation  j  of  a  count  variable  for  group  i,i  =  1  =  1  • 

Suppose  that  {Yjj]  are  independent  Poisson  with  E(Y,j)  —  fi,. 

a.  Show  that  the  ML  estimate  of  fi,  is  /t,  =  y,  =  JU  y,,/ n,. 

b.  Simplify  the  expression  for  the  deviance  for  this  model.  [For  testing  this  model,  it 

follows  from  Fisher  (1970,  p.  58,  originally  published  in  1925)  that  the  deviance 
and  the  Pearson  statistic  22  .  (y,;  —  y,  )2/y,  have  approximate  chi-squared  dis¬ 
tributions  with  df  =  ~  *  )•  F°r  a  single  group,  Cochran  (1954)  referred  to 

22  Ayij  —  y\)2/y\  as  the  variance  test  for  the  fit  of  a  Poisson  distribution,  since 
it  compares  the  sample  variance  to  the  estimated  Poisson  variance  y  \ .] 

4.27  For  known  k,  show  that  the  negative  binomial  distribution  (4.13)  has  exponential 
family  form  (4. 1 )  with  natural  parameter  log [/x /(/u.  +  £)]. 

4.28  Consider  the  normal  distribution  N(ji,  a2). 

a.  With  fixed  a ,  show  it  satisfies  exponential  family  (4.1),  and  identify  the  com¬ 
ponents.  Formulate  the  ordinary  regression  model  as  a  GLM.  Explain  why  the 
least-squares  estimates  are  then  ML  and  the  deviance  is  £L(y,  ~  A;)2- 

b.  When  a  is  also  a  parameter,  show  that  it  satisfies  the  exponential  dispersion 
family  (4.17). 

4.29  For  a  GLM,  refer  to  the  adjusted  response  variable  in  Section  4.6.4.  Let  zo  =  W  z, 
X()  —  W  X,  and  rj0  —  W  Xp.  Show  that  (a)  ft  is  the  ordinary  least-squares 
solution  for  the  model  Zo  =  XqP  +  e,  (b)  the  estimated  hat  matrix  for  the  GLM 
equals  Xo(Xy  Xo )Xq,  (c)  (zo  —  i/0)  are  the  Pearson  residuals,  and  (d)  zo  and  ij0 
are  analogs  for  GLMs  of  y  and  ji  in  ordinary  linear  models,  in  terms  of  projec¬ 
tions  and  orthogonal  decompositions.  (Thanks  to  Gianfranco  Lovison  for  suggest¬ 
ing  these  results.  See  also  Fahrmeir  and  Tutz  2001,  pp.  147-148,  and  Tutz  201 1, 
Sec.  3.10.) 

4.30  Let  /J(0)  denote  an  initial  guess  for  the  value  $  that  maximizes  a  function  L(/3). 

a.  Using  L\j J)  =  L'(Pi0) )  +  ($-  /J(0))Z/'(A<0,)  +  ■  •  -,  argue  that  for  £(0)  close  to  $, 
approximately  0  —  L'(/J{0))  +  {$  —  ^<0))Z./,(/S(0)).  Solve  this  equation  to  obtain 
an  approximation  /5(l)  for  ft. 

b.  Let  Pi,)  denote  approximation  t  for  /6,  /  =  0,  1 , 2, _ Justify  that  the  next  ap¬ 

proximation  is 

P{,+ 'I  =  PU)  -  L'(P(,))/L’\P{,)). 

4.31  For  n  independent  observations  from  a  Poisson  distribution,  show  that  Fisher  scoring 
gives  /A'+l)  =  y  for  all  t  >  0.  By  contrast,  what  happens  with  Newton-Raphson? 

4.32  Write  a  computer  program  using  the  Newton-Raphson  algorithm  to  maximize  the 
likelihood  for  a  binomial  sample.  For  ft  =  0.3  based  on  n  =  10,  print  out  results 

of  the  first  six  iterations  when  the  starting  value  7r(0)  is  (a)  0.1,  (b)  0.2, _ 0.9. 

Summarize  the  effects  of  the  starting  value  on  speed  of  convergence.  What  happens 
if  jt  is  0  or  1  ? 
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4.33  For  noncanonical  links  in  a  GLM,  show  that  the  observed  information  matrix  depends 
on  the  data  and  hence  differs  from  the  expected  information.  Illustrate  using  the  probit 
model. 


4.34  Suppose  y,\ ,  y,2, . . . ,  y,„:  are  responses  on  a  binary  survey  question  for  n,  members 
of  a  family,  with  P{Yjj  =  1 )  =  n  =  1  —  P(Y,j  =  0),  j  —  1 i  —  1 , . . . ,  N . 

a.  Overdispersion:  In  each  family,  suppose  everyone  sees  the  response  by  the  “head 
of  household”  and  then  makes  the  same  response.  State  the  distribution  of  Ytj 
and  compare  its  variance  to  that  of  the  binomial. 

b.  Underdispersion:  In  each  family,  suppose  member  j  hears  the  response  of  member 

j  —  1  and  makes  the  opposite  response,  j  =  2, . . . ,  ,  with  n,  an  even  number. 

State  the  distribution  of  Y:J  and  compare  its  variance  to  that  of  the  binomial. 

4.35  Sometimes,  sample  proportions  are  continuous  rather  than  of  the  binomial  form 
(number  of  successes)/(number  of  trials).  Each  observation  is  any  real  number 
between  0  and  1,  such  as  the  proportion  of  a  tooth  surface  that  is  covered  with 
plaque.  For  independent  responses  {y,},  Aitchison  and  Shen  (1980),  Bartlett  (1937), 
and  Lesaffre  et  al.  (2007)  modeled  logit(F, )  ~  N(P, ,  cr2).  Then  T,  itself  has  a  logit- 
normal  distribution  (Section  1 .6.2). 

a.  Expressing  a  N (ft,  a2)  variate  as  ft  +  aZ,  where  Z  is  standard  normal,  show  that 
Y,  =  expOS,  +  a  Z)/[l  +  expOS,  +  crZ)]. 

b.  Show  that  for  small  a, 

ep‘  1  ePi  ( 1  —  )  2  2 

Yj  —  - - h  - 7T  - -jrO  Z  +  -TTT - a  .  ,  +  *  *  *  . 

1  +  1  +  eft  1  +  eft  2(  1  +  eft  )3 

c.  Letting  jU,  =  eft/(l  +  eft),  when  a  is  close  to  0  show  that 

E(Y,)  %  Hi,  var(T,)  %  [/r,(l  - 

d.  For  independent  continuous  proportions  {y,},  let  n,  =  E(Y,).  For  a  GLM,  it 
is  sensible  to  use  an  inverse  cdf  link  for  t L>  but  it  is  unclear  how  to  choose 
a  distribution  for  Y,.  The  approximate  moments  for  the  logit-normal  motivate 
a  quasi-likelihood  approach  (Wedderbum  1974)  with  variance  function  v(ji, )  = 
cp[lij  ( 1  —  /j,;)]2  for  unknown  <p.  Explain  why  this  provides  similar  results  as  fitting 
a  normal  regression  model  to  the  sample  logits  assuming  constant  variance.  (The 
QL  approach  has  the  advantage  of  not  requiring  adjustment  of  0  or  1  observations, 
for  which  sample  logits  do  not  exist.) 

e.  Wedderbum  (1974)  modeled  data  on  the  proportion  of  a  leaf  showing  a  type 
of  blotch.  Envision  an  approximation  of  binomial  form  based  on  cutting  each 
leaf  into  a  large  number  of  small  regions  of  the  same  size  and  observing  for 
each  region  whether  it  is  mostly  covered  with  blotch.  Explain  why  this  suggests 
that  =  (p/iji  1  —  /i t).  What  violation  of  the  binomial  assumptions  might 
make  this  questionable?  [The  parametric  family  of  beta  distributions  has  variance 
function  of  this  form  (see  Section  14.3. 1  and  Cox  1996).] 


CHAPTER  5 


Logistic  Regression 


In  introducing  generalized  linear  models  for  binary  data  in  Chapter  4  we  highlighted  logistic 
regression.  This  is  the  most  important  model  for  categorical  response  data,  being  commonly 
used  for  a  wide  variety  of  applications. 

Early  uses  of  logistic  regression  were  in  biomedical  studies,  for  instance,  to  model 
whether  subjects  have  a  particular  condition  such  as  lung  cancer.  The  past  25  years  have 
seen  much  use  in  social  science  research,  for  modeling  opinions  and  behavior  decisions, 
and  in  business  applications.  In  credit-scoring ,  logistic  regression  is  used  to  model  the 
probability  that  a  subject  is  credit  worthy.  For  instance,  the  probability  that  a  subject  pays 
a  bill  on  time  may  use  predictors  such  as  the  size  of  the  bill,  annual  income,  occupation, 
mortgage  and  debt  obligations,  percentage  of  bills  paid  on  time  in  the  past,  and  other  aspects 
of  an  applicant’s  credit  history.  Another  area  of  increasing  application  is  genetics,  such  as  to 
estimate  quantitative  trait  loci  effects  by  modeling  the  probability  that  an  offspring  inherits 
an  allele  of  one  type  instead  of  another  type  as  a  function  of  phenotypic  values  on  various 
traits  for  that  offspring. 

In  this  chapter  we  study  logistic  regression  more  closely.  Section  5. 1  discusses  parameter 
interpretation.  In  Section  5.2  we  present  inferential  methods  for  those  parameters.  Sections 
5.3  and  5.4  generalize  the  model  to  multiple  predictors,  which  may  be  quantitative  and/or 
qualitative.  Finally,  in  Section  5.5  we  apply  GLM  fitting  methods  to  specify  and  solve 
likelihood  equations  for  logistic  regression. 

5.1  INTERPRETING  PARAMETERS  IN  LOGISTIC  REGRESSION 


For  a  binary  response  variable  Y  and  an  explanatory  variable  X,  let  n(x)  =  P(Y  =  1  |X  = 
x )  =  1  —  P(Y  =  0|X  =  x).  The  logistic  regression  model  is 


„  exp(a  +  fix) 

jt(.y)  = - . 

1  +  exp(a  +  fix) 

Equivalently,  the  logit  (log  odds)  has  the  linear  relationship 


(5.1) 


logit[7T(.v)]  =  log  'T('^  =  a  +  fix. 

1  -  tr(x) 


(5.2) 
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5.1.1  Interpreting  /?:  Odds,  Probabilities,  and  Linear  Approximations 

How  can  we  interpret  ft  in  (5.2)?  Its  sign  determines  whether  7r  (a)  is  increasing  or  decreasing 
as  x  increases.  The  rate  of  climb  or  descent  increases  as  |/3|  increases;  as  /3  — >  0  the  curve 
flattens  to  a  horizontal  straight  line.  When  ft  =  0,  Y  is  independent  of  A.  For  quantitative  a 
with  ft  >  0,  the  curve  for  7t(a)  has  the  shape  of  the  cdf  of  the  logistic  distribution  (recall 
Section  4.2.5).  Since  the  logistic  density  is  symmetric,  7t(a)  approaches  1  at  the  same  rate 
that  it  approaches  0. 

Exponentiating  both  sides  of  (5.2)  shows  that  the  odds  are  an  exponential  function  of  a. 
This  provides  a  basic  interpretation  for  the  magnitude  of  f):  The  odds  multiply  by  e P  for 
every  1  -unit  increase  in  .v.  In  other  words,  is  an  odds  ratio,  the  odds  at  A  =  a  +  1  divided 
by  the  odds  at  A  =  a. 

Many  scientists  are  not  familiar  with  odds  or  logits,  so  the  interpretation  of  a  multiplica¬ 
tive  effect  of  on  the  odds  scale  or  an  additive  effect  of  ft  on  the  logit  scale  is  not  helpful 
to  them.  A  simpler  slope  interpretation  uses  a  linearization  argument  (Berkson  1951).  Since 
it  has  a  curved  rather  than  a  linear  appearance,  the  logistic  regression  function  (5.1)  implies 
that  the  rate  of  change  in  jt(a)  per  unit  change  in  a  varies.  A  straight  line  drawn  tangent  to 
the  curve  at  a  particular  a  value,  shown  in  Figure  5.1,  describes  the  instantaneous  rate  of 
change  at  that  point.  Calculating  37t(a)/9a  with  (5.1)  yields  a  fairly  complex  function  of 
the  parameters  and  a,  but  it  simplifies  to  the  form  /3 7r ( a ) [  1  -  7t(a)J. 

For  instance,  the  line  tangent  to  the  curve  at  a  for  which  k(x)  —  |  has  slope  ft  (5)  (^)  = 
ft /4;  when  n(x)  =  0.9  or  0. 1 ,  it  has  slope  0.09/1.  The  slope  approaches  0  as  7t(a)  approaches 
1.0  or  0.  The  steepest  slope  occurs  at  a  for  which  jt(x)  =  i;  that  x  value  is  a  =  —a/fi.  [To 
check  that  jt(a)  =  \  at  this  point,  substitute  —a/fi  for  a  in  (5.1),  or  substitute  7t(a)  —  \ 


Figure  5.1  Linear  approximation  to  logistic  regression  curve. 
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in  (5.2)  and  solve  for  x.]  This  x  value  is  sometimes  called  the  median  effective  level.  In 
toxicology  studies  it  is  called  LD50  (LD  =  lethal  dose),  the  dose  with  a  50%  chance  of  a 
lethal  result. 

From  this  linear  approximation,  near  x  where  n(x)  —  j,  a  change  in.vof  I  //)  corresponds 
to  a  change  in  7r(x)  of  roughly  (l//l)(/3/4)  =  that  is,  1 //)  approximates  the  distance 
between  x  values  where  tt(x)  =  0.50  and  where  7r(x)  =  0.25  or  0.75  (in  reality,  0.27  and 
0.73).  The  linear  approximation  works  better  for  smaller  changes  in  x,  however.  Since  the 
rate  of  change  varies  according  to  the  value  of  x,  a  summary  of  them  is  the  average  of 
fin(Xj)[\  —  71  (Xj )]  for  the  subjects  in  the  sample. 

An  alternative  way  to  interpret  the  effect  reports  the  values  of  7r(x)  at  certain  .v  values, 
such  as  at  the  minimum  and  maximum  values.  To  do  this,  we  substitute  the  values  for  x 
into  formula  (5.1)  for  n(x).  It  is  more  resistant  to  outliers  on  .v  to  report  the  tt(x)  values 
at  the  quartiles  of  x  than  at  the  extremes.  The  change  in  tt(x)  over  the  middle  half  of  x 
values,  from  the  lower  quartile  to  the  upper  quartile,  is  a  useful  summary  of  the  effect.  It 
can  be  compared  with  the  corresponding  change  over  the  middle  half  of  values  of  other 
quantitative  predictors. 

The  intercept  parameter  a  is  not  usually  of  particular  interest.  However,  by  centering 
the  predictor  about  0  [i.e.,  replacing  x  by  (x  —  v)J,  a  becomes  the  logit  at  x  =  x,  and  thus 
e°7(l  +  ea)  —  Jt(x).  As  in  ordinary  regression,  centering  is  also  helpful  in  complex  models 
containing  quadratic  or  interaction  terms  to  reduce  correlations  among  model  parameter 
estimates. 

5.1.2  Looking  at  the  Data 

In  practice,  these  interpretations  use  formula  (5.1 )  with  ML  estimates  substituted  for  pa¬ 
rameters.  Before  fitting  the  model  and  making  such  interpretations,  look  at  the  data  to  check 
that  the  logistic  regression  model  is  appropriate.  Since  y  takes  only  values  0  and  1 ,  it  is 
difficult  to  check  this  by  an  ordinary  scatterplot  of  observed  (x,  y)  values. 

It  can  be  helpful  to  plot  sample  proportions  or  logits  againstx.  Let  n,  denote  the  number  of 
observations  at  setting  i  of  x.  Of  them,  let  y,  denote  the  number  of  “  1  ”  outcomes,  with  /?,  = 
yi/rij.  Sample  logit  (also  called  empirical  logit )  /  is  log[/?,/(l  —  p,)]  =  log[_y,/(/7,  —  >’,)]• 
The  scatterplot  of  sample  logits  should  be  roughly  linear.  The  sample  logit  is  not  finite  when 
y t  =  0  or  .  An  ad  hoc  adjustment  adds  a  positive  constant  to  the  number  of  outcomes  of 
the  two  types.  The  adjustment 


log 


v;  +  5 
«,  -  y,  +  5 


is  the  least-biased  estimator  of  this  form  for  the  true  logit  (see  Note  5.2). 

When  x  is  continuous  and  all  «,■  =  1 ,  or  when  x  is  essentially  continuous  and  all  n,  are 
small,  this  is  unsatisfactory.  We  could  group  the  data  with  nearby  x  values  into  categories 
before  calculating  sample  proportions  and  sample  logits.  A  better  approach  that  does  not  re¬ 
quire  choosing  arbitrary  categories  uses  a  smoothing  mechanism  to  reveal  trends.  One  such 
smoothing  approach  fits  a  generalized  additive  model  (to  be  introduced  in  Section  7.4.9), 
which  replaces  the  linear  predictor  of  a  GLM  by  a  smooth  function.  A  plot  of  this  fit 
reveals  whether  severe  discrepancies  occur  from  the  S-shaped  trend  predicted  by  logistic 
regression. 
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5.1.3  Example:  Horseshoe  Crab  Mating  Revisited 

To  illustrate  logistic  regression,  we  reanalyze  the  horseshoe  crab  data  introduced  in  Sec¬ 
tion  4.3.2.  The  binary  response  is  whether  a  female  crab  has  any  male  crabs  residing  nearby 
(satellites).  For  crab  i ,  let  y,  =  1  if  she  has  at  least  one  satellite  and  y,  =  0  if  she  has  none. 
Here,  we  use  as  a  predictor  the  female  crab’s  carapace  width. 

Figure  5.2  plots  the  data  against  x  =  width.  The  scatterplot  consists  of  a  set  of  points 
with  y,  =  1  and  a  second  set  of  points  with  y,  =  0.  The  numbered  symbols  indicate  the 
number  of  observations  at  each  point.  It  appears  that  y,  =  1  tends  to  occur  relatively  more 
often  at  higher  x  values;  in  fact,  all  crabs  with  width  >29  cm  have  satellites.  The  positive 
effect  of  width  is  also  suggested  by  the  grouping  of  the  data  used  to  investigate  adequacy  of 
Poisson  regression  models  in  Section  4.3.3  (Table  4.4).  In  each  of  the  eight  width  categories, 
we  computed  the  sample  proportion  of  crabs  having  satellites  and  the  mean  width  for  the 
crabs  in  that  category.  Figure  5.2  shows  eight  dots  representing  the  sample  proportions 
of  female  crabs  having  satellites  plotted  against  the  mean  widths  for  the  eight  categories. 
Figure  5.2  also  shows  a  curve  based  on  smoothing  the  data  using  the  generalized  additive 
modeling  method,  assuming  a  binomial  response  and  logit  link.  This  curve  shows  a  roughly 
increasing  trend  and  is  more  informative  than  viewing  the  binary  data  alone.  It  suggests  that 
an  S-shaped  regression  function  may  describe  this  relationship  relatively  well.  Since  the 
eight  plotted  sample  proportions  and  the  GAM  smoothing  curve  both  suggest  an  increasing 
trend,  we  proceed  with  fitting  the  logistic  regression  model  with  linear  width  predictor. 


Figure  5.2  Whether  satellites  are  present  (1  =  yes,  0  =  no)  by  width  of  female  crab,  with  smoothing  fit  of 
generalized  additive  model. 
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Table  5.1  Output  (Based  on  SAS)  for  Logistic  Regression  Model  with 
Horseshoe  Crab  Data 


Criteria  For  Assessing  Goodness  Of  Fit 


Criterion 

DF 

Value 

Deviance 

171 

194 .4527 

Pearson  Chi-Square 

171 

165 . 1434 

Log  Likelihood 

-97.2263 

Parameter 

Intercept 

width 


Std 

Estimate  Error 
-12.3508  2.6287 

0.4972  0.1017 


Likelihood- Ratio 
95%  Conf  Limits 
-17.8097  -7.4573 

0.3084  0.7090 


Wald 

Chi-Sq  P>ChiSq 
22.07  <.0001 

23.89  <.0001 


We  defer  to  Section  5.5  the  details  about  ML  fitting.  Software  (see  the  text  website) 
reports  output  such  as  Table  5.1  exhibits.  Let  n(x)  denote  the  probability  that  a  female 
horseshoe  crab  of  width  x  has  a  satellite.  The  ML  fit  is 

exp(- 12.351  +  0.497x ) 

?r(x)  “  1  +exp(- 12.351  +0.497*)' 

Substituting  a-  =  26.3  cm,  the  mean  width  level  in  this  sample,  A  (a)  =  0.674.  The  estimated 
probability  equals  \  when  a  =  —a/ ft  =  12.351/0.497  =  24.8. 

Figure  5.3  plots  A(x)  from  the  logistic  fit  against  width,  again  superimposing  the  sample 
proportions  that  we  viewed  in  Figure  5.2.  The  curve  seems  to  follow  reasonably  well  the 
trend  in  those  proportions. 

The  estimated  odds  of  a  satellite  multiply  by  exp(j6)  =  exp(0.497)  =  1 .64  for  each 
1-cm  increase  in  width;  that  is,  there  is  a  64%  increase.  To  convey  the  effect  less  tech¬ 
nically,  we  could  report  the  incremental  rate  of  change  in  the  probability  of  a  satellite. 
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Figure  5.3  Logistic  regression  fitted  curve  and  sample  proportions  of  satellites,  by  width  of  female  crab. 
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At  the  mean  width,  ft(x)  —  0.674,  and  ft(x)  increases  by  about  ft[A(x)(\  —  A(x))]  — 
0. 497(0. 674)(0. 326)  =  0.11  for  a  1-cm  increase  in  width.  Or,  we  could  report  fi(x)  at  the 
quartiles  of  x.  The  lower  quartile,  median,  and  upper  quartile  for  width  are  24.9,  26.1,  and 
27.7;  n(x )  at  those  values  equals  0.51,  0.65,  and  0.81,  increasing  by  0.30  over  the  x  values 
for  the  middle  half  of  the  sample. 

The  latter  summary  is  useful  for  comparing  the  effects  of  predictors  having  different 
units.  For  instance,  with  the  female  crab’s  weight  as  the  predictor,  logit[jf  (jc)]  =  —3.695  + 
1.815.x.  A  1  -kg  increase  in  weight  is  not  comparable  to  a  1-cm  increase  in  width,  so 
ft  —  1 .8 15  for  x  —  weight  is  not  comparable  to  ft  =  0.497  for  x  =  width.  The  quartiles  for 
weight  are  2.00,  2.35,  and  2.85;  A(x)  at  those  values  are  0.48,  0.64,  and  0.81,  increasing 
by  0.33  over  the  middle  half  of  the  sampled  weights.  The  effect  is  similar  to  that  of  width, 
which  is  not  surprising  as  these  predictors  are  very  highly  correlated. 


5.1.4  Logistic  Regression  with  Retrospective  Studies 

Another  property  of  logistic  regression  relates  to  situations  in  which  the  explanatory  variable 
X  rather  than  the  response  variable  Y  is  random.  This  occurs  with  retrospective  sampling 
designs,  such  as  case-control  biomedical  studies.  For  samples  of  subjects  having  y  —  1 
(cases)  and  having  y  =  0  (controls),  the  value  of  X  is  observed.  Evidence  exists  of  an 
association  if  the  distribution  of  X  values  differs  between  cases  and  controls.  In  retrospective 
studies,  we  can  estimate  odds  ratios.  Effects  in  the  logistic  regression  model  refer  to  odds 
ratios.  We  can  fit  such  models  and  estimate  effects  in  case-control  studies  (Breslow  and 
Powers  1978,  Prentice  and  Pyke  1979). 

Here  is  a  justification  for  this.  Let  Z  indicate  whether  a  subject  is  sampled  (1  =  yes, 
0  =  no).  Let  p\  =  P(Z  =  l|_y  =  1)  denote  the  probability  of  sampling  a  case,  and  let 
po  =  P(Z  =  l|y  =  0)  denote  the  probability  of  sampling  a  control.  Even  though  the  condi¬ 
tional  distribution  of  Y  given  X  —  x  is  not  sampled,  we  need  a  model  for  P(Y  —  I|z  =  l,x), 
assuming  that  P(Y  =  l|x)  follows  the  logistic  model.  By  Bayes’  theorem. 


P(Z  =  1 1  v  =  1  ,x)P(Y  =  l|x) 

P(Y  =  1  z  =  1 ,  a  )  =  — - - - — — - — - . 

j:)=0[P(Z  =  \\y  =  j,x)P(Y=j\x)] 


(5.3) 


Now,  suppose  that  P(Z  =  1 1 y,  ,v)  =  P(Z  =  1 1 y)  for  y  =  0  and  1 ;  that  is,  for  each  y,  the 
sampling  probabilities  do  not  depend  on  x.  For  instance,  often  x  refers  to  exposure  of 
some  type,  such  as  whether  someone  has  been  a  smoker.  Then,  for  cases  and  for  controls, 
the  probability  of  being  sampled  is  the  same  for  smokers  and  nonsmokers.  Under  this 
assumption,  substituting  p\  and  po  in  (5.3)  and  dividing  numerator  and  denominator  by 
P(Y  =  0|.v),  we  get 


P(r  =  i|z  =  i. v>  =  +  . 

Po  +  Pi  exp(a  +  ftx) 

Then,  dividing  numerator  and  denominator  by  po  and  using  pi/po  =  exp[log(pi/po)]  yields 
logit[P(T  =  l|z  =  1,  jc)]  =  a*  +  ftx 

with  a*  —  a  4-  log(pi/po).  The  logistic  regression  model  holds  with  the  same  effect  pa¬ 
rameter  ft  as  in  the  model  for  P(Y  —  1  |.v).  If  the  sampling  rate  for  cases  is  greater  than 
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that  for  controls,  the  intercept  estimated  is  larger  than  the  one  estimated  with  a  prospective 
study. 

With  case-control  studies,  it  is  not  possible  to  estimate  p  in  binary-response  models  with 
links  other  than  the  logit.  Unlike  the  odds  ratio,  the  effect  for  the  conditional  distribution  of 
X  given  y  does  not  then  equal  that  for  Y  given  x.  This  is  an  important  advantage  of  the  logit 
link  and  is  one  reason  why  logistic  regression  models  are  so  popular  in  biomedical  studies. 

Many  case-control  studies  employ  matching.  Each  case  is  matched  with  one  or  more 
control  subjects.  The  controls  are  like  the  case  on  key  characteristics  such  as  age.  The 
model  and  subsequent  analysis  should  take  the  matching  into  account.  In  Section  1 1 .2.5 
we  discuss  logistic  regression  for  matched  case-control  studies. 


5.1.5  Logistic  Regression  Is  Implied  by  Normal  Explanatory  Variables 

Regardless  of  the  sampling  mechanism,  logistic  regression  may  or  may  not  describe  a 
relationship  well.  In  one  special  case,  it  necessarily  holds.  Given  that  Y  =  /,  suppose  thatX 
has  a2)  distribution,  /  —  0,  1 .  Then,  by  Bayes’  theorem,  P(Y  =  1  |X  =  jc)  satisfies 
the  logistic  model  with  p  =  —  fio)/a2  (Cornfield  1962).  Thus,  when  a  population  is 

a  mixture  of  two  types  of  subjects,  one  type  with  y  =  1  that  is  approximately  normally 
distributed  on  X  and  the  other  type  with  y  =  0  that  is  approximately  normal  on  X  with 
similar  variance,  the  logistic  regression  function  approximates  well  the  curve  for  7r(.v). 

The  result  extends  to  a  vector  of  explanatory  variables  having  multivariate  normal 
distributions  in  each  case  (Exercise  5.30  and  Section  15. 1 . 1 ).  If  the  distributions  are  normal 
but  with  different  variances,  the  model  applies  but  having  a  quadratic  term.  In  that  case,  the 
relationship  is  nonmonotone,  with  7r(.v)  increasing  and  then  decreasing,  or  the  reverse. 


5.2  INFERENCE  FOR  LOGISTIC  REGRESSION 

By  standard  results,  ML  estimators  of  logistic  regression  model  parameters  have  large- 
sample  normal  distributions.  Inference  can  use  the  (Wald,  likelihood-ratio,  score)  triad  of 
methods  introduced  in  Section  1 .3.3. 


5.2.1  Inference  About  Model  Parameters  and  Probabilities 

For  the  logistic  model  with  a  single  predictor. 


logit[7rU)j  =  a  +  fix. 


significance  tests  focus  on  Hq:  (5  =  0,  the  hypothesis  of  independence.  The  Wald  test  uses 
the  log  likelihood  at  p,  with  test  statistic  2  =  P/SE  or  its  square;  under  Hq,  z 2  is  asymp¬ 
totically  xf-  The  likelihood-ratio  test  uses  twice  the  difference  between  the  maximized  log 
likelihood  at  p  and  at  p  =  0  and  also  has  an  asymptotic  X\  null  distribution.  The  score  test 
uses  the  log  likelihood  at  P  =  0  through  the  derivative  of  the  log  likelihood  (i.e.,  the  score 
function)  at  that  point.  The  test  statistic  compares  the  sufficient  statistic  for  p  to  its  null 
expected  value,  suitably  standardized  |7V(0,1)  or  xfl-  In  Section  5.3.5  we  present  this  test. 
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A  confidence  interval  for  fi  results  from  inverting  a  test  of  H0:  ft  =  fi0.  The  interval  is 
the  set  of  fto  for  which  the  chi-squared  test  statistic  is  no  greater  than  x,2(or)  =  z2/7.  For  the 
Wald  approach,  this  means  [(/?  —  fio)/SE]2  <  z2/2,  so  the  interval  is  0  ±  za/2 (SE). 

For  summarizing  the  relationship,  other  characteristics  may  have  greater  importance 
than  jfl,  such  as  7t(x)  at  various  x  values.  For  fixed  x  =  xo,  logit[jf  (xo)]  =  a  +  0xq  has  a 
large-sample  SE  given  by  the  estimated  square  root  of 

var(a  +  0xq)  =  var(ff)  +  x(2  var(/3)  +  2xo  cov(a,  0). 

A  95%  confidence  interval  for  logitfzr (jto)]  is  (<5  +  0xq)  ±  1.96(5£').  Substituting  each 
endpoint  into  the  inverse  transformation  rr(xo)  =  exp(logit)/[l  +  exp(logit)]  gives  a  corre¬ 
sponding  interval  for  n(x o). 

5.2.2  Example:  Inference  for  Horseshoe  Crab  Mating  Data 

We  illustrate  logistic  regression  inferences  with  the  model  for  the  probability  that  a  horse¬ 
shoe  crab  has  a  satellite,  with  crab  width  as  the  predictor.  Table  5.1  showed  the  fit  and 
standard  errors.  The  statistic  z  =  ft /SE  =  0.497 /0. 1 02  =  4.S9  provides  strong  evidence  of 
a  positive  width  effect  (P  <  0.000 1).  The  equivalent  Wald  chi-squared  statistic,  z2  —  23.89, 
has  df  =  1.  The  maximized  log  likelihoods  equal  —1 12.88  under  Hq:  ft  =  0  and  —97.23 
for  the  full  model.  The  likelihood-ratio  statistic  equals  — 2[ —  1 12.88  —  (—97.23)]  —  31.31, 
with  df  =  1.  This  provides  even  stronger  evidence  than  the  Wald  test. 

The  Wald  95%  confidence  interval  for  ft  is  0.497  ±  1.96(0.102),  or  (0.298,  0.697). 
Table  5.1  reports  a  likelihood-ratio  confidence  interval  of  (0.308,  0.709),  based  on  the 
profile  likelihood  function.  The  confidence  interval  for  the  effect  on  the  odds  per  1-cm 
increase  in  width  equals  (e0  308,  e0J09)  =  (1.36,2.03).  We  infer  that  a  1-cm  increase  in 
width  has  at  least  a  36%  increase  and  at  most  a  doubling  in  the  odds  of  a  satellite. 

Most  software  for  logistic  regression  also  can  report  estimates  and  confidence  intervals 
for  7t(x)  (for  examples,  see  the  text  website).  Consider  this  for  crabs  of  width  x  =  26.5, 
which  is  near  the  mean  width.  The  estimated  logit  is  —12.351  +  0.497(26.5)  =  0.826,  and 
7t(x)  —  0.695.  Software  reports 

var(a)  =  6.91023,  var(/3)  —  0.01035,  cov(«,  0)  —  —0.26685, 


from  which 


vaf{logit[^(x)]}  =  6.91023  +  x2(0. 01035)  +  2jc(-0.26685). 

At  x  =  26.5  this  is  0.0356,  so  the  95%  confidence  interval  for  logit[7r(26.5)]  equals 
0.826  ±  (1.96)\/0.0356,  or  (0.456,  1.196).  This  translates  to  the  interval  (0.61,  0.77)  for 
the  probability  of  satellites  (e.g.,  exp(0.456)/[l  +  exp(0.456)]  =  0.61).  Since  corr(a,  0)  is 
near  1 .0,  for  better  computational  precision,  fit  the  model  using  predictor  x*  =  x  —  26.5, 
so  that  a  and  its  SE  are  the  estimated  logit  and  its  SE.  Figure  5.4  plots  the  confidence  bands 
around  the  prediction  equation  for  n(x)  as  a  function  of  x.  Hauck  (1983)  gave  alternative 
bands  for  which  the  confidence  coefficient  applies  simultaneously  to  all  possible  predictor 
values. 

We  could  ignore  the  model  fit  and  simply  use  sample  proportions  (i.e.,  the  saturated 
model)  to  estimate  such  probabilities.  Six  female  crabs  in  the  sample  had  x  =  26.5,  and 
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Predicted  Event  Probability 

With  95%  Confidence  Limits 


I  ° Observed - Predicted 


Figure  5.4  Prediction  equation  and  95%  confidence  hands  (from  SAS  PROC  LOGISTIC)  for  probability  of 
satellite  its  a  function  of  width. 


four  of  them  had  satellites.  The  sample  proportion  estimate  at  x  —  26.5  is  if  =  4/6  =  0.67, 
similar  to  the  model-based  estimate.  The  95%  score  confidence  interval  (Section  1.4.2) 
based  on  these  six  observations  alone  equals  (0.30, 0.90). 

When  the  logistic  regression  model  truly  holds,  the  model-based  estimator  of  a  probabil¬ 
ity  is  considerably  belter  than  the  sample  proportion.  The  model  has  only  two  parameters  to 
estimate,  whereas  the  saturated  model  has  a  separate  parameter  for  every  distinct  value  of  x. 
For  instance,  at  x  —  26.5,  software  reports  SE  =  0.04  for  the  model-hased  estimate  0.695, 
whereas  the  SE  is  —  ft)/ n  =  x/(0.67)(0.33)/6  =  0.19  for  the  sample  proportion  of 

0.67  with  only  6  observations.  The  95%  confidence  intervals  are  (0.61,  0.77)  using  the 
model  versus  (0.30,  0.90)  using  the  sample  proportion.  Instead  of  using  only  6  observa¬ 
tions.  the  model  uses  the  information  that  all  173  observations  provide  in  estimating  the 
two  model  parameters.  The  result  is  a  much  more  precise  estimate. 

Reality  is  a  bit  more  compl  icated.  In  practice,  the  model  is  not  exactly  the  true  relationship 
bet  w  een  n(x)  and  x.  However,  if  it  approximates  the  true  probabilities  decently,  its  estimator 
still  tends  to  lie  closer  than  the  sample  proportion  to  the  true  value.  The  model  smooths 
the  sample  data,  somewhat  dampening  the  observed  variability.  The  resulting  estimators 
tend  to  he  better  unless  each  sample  proportion  is  based  on  an  extremely  large  sample. 
Section  5.3.10  discusses  this  advantage  of  using  models. 


5.2.3  Checking  Goodness  of  Fit:  Grouped  and  Ungrouped  Data 

In  practice,  there  is  no  guarantee  that  aeertain  logistic  regression  model  fits  the  data  well.  For 
any  type  of  binary  data,  one  way  to  detect  lack  of  fit  uses  a  likelihood-ratio  lest  to  compare 
the  model  to  more  complex  ones.  A  more  complex  model  might  contain  a  nonlinear  effect. 
Models  w  ith  multiple  predictors  would  consider  interaction.  If  more  complex  models  do 
not  lit  better,  this  provides  some  assurance  that  the  model  chosen  is  reasonable. 
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Other  approaches  to  detecting  lack  of  fit  search  for  any  way  that  the  model  fails.  This 
is  simplest  when  the  explanatory  variables  are  solely  categorical,  as  we’ll  illustrate  in 
Section  5.4.3.  At  each  setting  of  x,  multiplying  the  estimated  probabilities  of  the  two 
outcomes  by  the  number  of  subjects  at  that  setting  yields  estimated  expected  frequencies 
for  y  =  0  and  y  =  1 .  These  are  fitted  values.  The  test  of  the  model  compares  the  observed 
counts  and  fitted  values  using  a  Pearson  X2  or  likelihood-ratio  G2  statistic.  For  a  fixed 
number  of  settings,  as  the  fitted  counts  increase,  X2  and  G2  have  limiting  chi-squared  null 
distributions.  The  degrees  of  freedom,  called  the  residual  df  for  the  model,  subtract  the 
number  of  parameters  in  the  model  from  the  number  of  parameters  in  the  saturated  model 
(i.e.,  the  number  of  settings  of  x). 

The  reason  for  the  restriction  to  categorical  predictors  for  a  global  test  of  fit  relates  to 
the  distinction  that  we  mentioned  in  Section  4.5.3  between  grouped  and  ungrouped  data  for 
binomial  models.  The  saturated  model  differs  in  the  two  cases.  An  asymptotic  chi-squared 
distribution  for  the  deviance  results  as  n  — »  oo  with  a  fixed  number  of  parameters  in  that 
model  and  hence  a  fixed  number  of  settings  of  predictor  values  (i.e.,  grouped  data). 

5.2.4  Example:  Model  Goodness  of  Fit  for  Horseshoe  Crab  Data 

We  illustrate  with  a  goodness-of-fit  analysis  for  the  model  using  x  =  width  to  predict  the 
probability  that  a  female  crab  has  a  satellite.  One  way  to  check  it  compares  it  to  a  more 
complex  model,  such  as  the  model  containing  a  quadratic  term  or  linear  spline.  With  width 
centered  at  0  by  subtracting  its  mean  of  26.3,  the  quadratic  model  has  fit 

logit[jf(x)]  —  0.618  +  0.533(.v  -  x)  +  0.040(x  -  x)2. 

The  quadratic  estimate  has  SE  =  0.046.  There  is  not  much  evidence  to  support  adding  that 
term.  The  likelihood-ratio  statistic  for  testing  that  the  true  coefficient  of  x2  is  0  equals  0.83 
(df  =  1). 

We  next  evaluate  overall  goodness  of  fit.  Width  takes  66  distinct  values  for  the  173 
crabs,  with  few  observations  at  most  widths.  We  can  view  the  data  as  a  66  x  2  contingency 
table.  The  two  cells  in  each  row  count  the  number  of  crabs  with  satellites  and  the  number 
of  crabs  without  satellites,  at  that  width.  The  chi-squared  theory  for  X2  and  G2  applies 
when  the  number  of  levels  of  x  is  fixed,  and  the  number  of  observations  at  each  level  grows. 
Although  we  grouped  the  data  using  the  distinct  width  values  rather  than  using  173  separate 
binary  responses,  this  theory  is  violated  here  in  two  ways.  First,  most  fitted  counts  are  very 
small.  Second,  when  more  data  are  collected,  additional  width  values  would  occur,  so  the 
contingency  table  would  contain  more  cells  rather  than  a  fixed  number.  Because  of  this, 
X2  and  G 2  for  logistic  regression  models  with  continuous  or  nearly  continuous  predictors 
do  not  have  approximate  chi-squared  distributions.  Normal  approximations  can  be  more 
appropriate  (see  Section  10.6.4  for  references),  but  no  such  method  has  become  as  popular 
as  methods  presented  next. 


5.2.5  Checking  Goodness  of  Fit  with  Ungrouped  Data  by  Grouping 

As  just  noted,  with  ungrouped  data  or  with  continuous  or  nearly  continuous  predictors,  X2 
and  G2  do  not  have  limiting  chi-squared  distributions.  They  are  still  useful  for  comparing 
models,  as  done  above  for  checking  a  quadratic  term.  Also,  we  can  apply  them  in  an 
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Table  5.2  Grouping  of  Observed  and  Fitted  Values  for  Fit  of 
Logistic  Regression  Model  to  Horseshoe  Crab  Data 


Width  (cm) 

Number 

Yes 

Number 

No 

Fitted 

Yes 

Fitted 

No 

<23.25 

5 

9 

3.64 

10.36 

23.25-24.25 

4 

10 

5.31 

8.69 

24.25-25.25 

17 

1 1 

13.78 

14.22 

25.25-26.25 

21 

18 

24.23 

14.77 

26.25-27.25 

15 

7 

15.94 

6.06 

27.25-28.25 

20 

4 

19.38 

4.62 

28.25-29.25 

15 

3 

15.65 

2.35 

>29.25 

14 

0 

13.08 

0.92 

approximate  manner  to  grouped  observed  and  fitted  values  for  a  partition  of  the  space  of  x 
values. 

Table  5.2  uses  the  groupings  of  Table  4.4,  giving  an  8  x  2  table.  In  each  width  category, 
the  fitted  value  for  a  “yes”  response  is  the  sum  of  the  estimated  probabilities  ft(x)  for 
all  crabs  having  width  in  that  category;  the  fitted  value  for  a  “no”  response  is  the  sum 
of  1  —  ft(x)  for  those  crabs.  The  fitted  values  are  then  much  larger.  Then,  X2  and  G2 
have  better  validity,  although  the  chi-squared  theory  still  is  not  perfect  because  tt(x)  is 
not  constant  in  each  category.  Their  values  are  X1  =  5.3  and  G2  =  6.2.  Table  5.2  has 
eight  binomial  samples,  one  for  each  width  setting;  the  model  has  two  parameters,  so 
df  =  8  —  2  =  6.  Neither  X2  nor  G2  shows  evidence  of  lack  of  fit  (P  >  0.4).  Thus,  we  can 
feel  more  comfortable  about  using  the  model  for  the  original  ungrouped  data. 

As  the  number  of  explanatory  variables  increases,  this  strategy  loses  effectiveness. 
Simultaneous  grouping  of  values  for  each  variable  can  produce  a  contingency  table  with  a 
large  number  of  cells,  most  of  which  have  very  small  counts. 

Regardless  of  the  number  of  explanatory  variables,  we  can  partition  observed  and  fitted 
values  according  to  the  estimated  probabilities  of  success  using  the  original  ungrouped 
data.  One  common  approach  forms  the  groups  in  the  partition  so  they  have  approximately 
equal  size.  With  10  groups,  the  first  pair  of  observed  counts  and  corresponding  fitted  counts 
refers  to  the  nl  10  observations  having  the  highest  estimated  probabilities,  the  next  pair 
refers  to  the  nl  10  observations  having  the  second  decile  of  estimated  probabilities,  and  so 
on.  Each  group  has  an  observed  count  of  subjects  with  each  outcome  and  a  fitted  value  for 
each  outcome.  The  fitted  value  for  an  outcome  is  the  sum  of  the  estimated  probabilities  for 
that  outcome  for  all  observations  in  that  group. 

This  construction  is  the  basis  of  a  test  due  to  Hosmer  and  Lemeshow  (1980).  They 
proposed  a  Pearson  statistic  comparing  the  observed  and  fitted  counts  for  this  partition.  Let 
yij  denote  the  binary  outcome  for  observation  j  in  group  /  of  the  partition,  i  =  1, . . . ,  g, 
j  =  1 .  Let  TTjj  denote  the  corresponding  fitted  probability  for  the  model  fitted  to  the 
ungrouped  data.  Their  statistic  equals 

yA  ( Ej  yij  -  E_i  %)2 

h  (Ey%)[>  -  (Ey  %)/«-]' 

When  many  observations  have  the  same  estimated  probability,  there  is  some  arbitrariness 
in  forming  the  groups,  and  different  software  may  report  somewhat  different  values.  This 
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statistic  does  not  have  a  limiting  chi-squared  distribution,  because  the  observations  in  a 
group  do  not  share  a  common  success  probability  and  thus  are  not  identical  trials.  However, 
Hosmer  and  Lemeshow  noted  that  when  the  number  of  distinct  patterns  of  covariate  values 
equals  the  sample  size,  the  null  distribution  is  approximated  by  chi-squared  with  df  —  g  —  2. 

For  the  logistic  regression  fitted  to  the  horseshoe  crab  data  with  continuous  width 
predictor,  the  Hosmer-Lemeshow  statistic  with  g  —  10  groups  equals  3.5,  with  df  =  8.  It 
also  indicates  a  decent  fit. 

In  summary,  the  X 2  and  G2  goodness-of-fit  tests  work  well  when  n  is  large  relative  to 
the  number  of  distinct  covariate  patterns,  whereas  the  Hosmer-Lemeshow  test  works  well 
when  the  number  of  distinct  covariate  patterns  is  large.  Unfortunately,  like  other  proposed 
global  fit  statistics,  the  Hosmer-Lemeshow  statistic  does  not  have  good  power  for  detecting 
particular  types  of  lack  of  fit  (Hosmer  et  al.  1997).  One  example  is  when  the  correct  model 
has  an  interaction  between  a  binary  and  continuous  covariate  but  the  chosen  model  has  only 
the  continuous  covariate.  Tsiatis  (1980)  suggested  an  alternative  goodness-of-fit  test  that 
partitions  values  for  the  explanatory  variables  into  a  set  of  regions  and  adds  an  indicator 
variable  to  the  model  for  each  region.  The  test  statistic  compares  the  fit  of  this  model  to  the 
simpler  one,  testing  that  the  extra  parameters  are  not  needed.  Alternatively,  one  could  use 
a  bootstrap  method  to  evaluate  fit.  Azzalini  et  al.  (1989)  used  the  parametric  bootstrap  to 
evaluate  the  distance  between  the  logistic  model  fit  and  a  nonparametric  smoothing  of  the 
data  (to  be  introduced  in  Section  7.4.2);  the  bootstrap  simulations  estimated  the  proportion 
of  times  that  a  likelihood-ratio  form  of  statistic  is  larger  than  observed.  In  any  case,  a  large 
value  of  any  global  fit  statistic  merely  indicates  some  lack  of  fit  but  provides  no  insight 
about  its  nature.  The  approach  of  comparing  the  working  model  to  a  more  complex  one 
is  more  useful  from  a  scientific  perspective,  since  it  searches  for  lack  of  fit  of  a  particular 
type. 

For  any  approach  to  checking  fit,  when  the  fit  is  poor,  diagnostic  measures  describe 
the  influence  of  individual  observations  on  the  model  fit  and  highlight  reasons  for  the 
inadequacy.  We  discuss  these  in  Section  6.2. 1 . 

5.2.6  Wald  Inference  Can  Be  Suboptimal 

Wald,  likelihood-ratio,  and  score  methods  of  inference  usually  give  similar  results  for  large 
samples.  Each  method  of  inference  can  also  produce  small-sample  confidence  intervals  and 
tests.  We  defer  discussion  of  this  until  Sections  7.3,  16.5,  and  16.6. 

Although  these  methods  usually  give  similar  results,  the  Wald  method  has  two  disadvan¬ 
tages  compared  with  the  likelihood-ratio  and  score  methods.  First,  its  results  depend  on  the 
scale  for  the  parameterization.  To  illustrate,  suppose  that  Y  has  a  bin(«,  it)  distribution.  For 
the  model,  logit(7r)  =  a,  consider  testing  7/0:  a  =  0  (i.e„  ji  =  0.50).  From  Section  3.1.6, 
the  asymptotic  variance  of  a  =  logit(jf)  (with  jt  —  y/n)  is  [mr(\  —  7r)]_l.  The  Wald  chi- 
squared  test  statistic  is  [logit(7T )]2[n7T ( 1  —  if)].  On  the  proportion  scale,  the  Wald  statistic 
is  (if  —  0.50)2[« /tt(  1  —  if)].  These  are  not  the  same.  For  example,  when  ft  is  near  0  or  1 
(so  |&|  is  large),  the  ratio  of  the  Wald  statistic  on  the  logit  scale  to  the  Wald  statistic  on  the 
proportion  scale  approaches  0  as  n  increases.  Evaluations  reveal  that  the  logit-scale  statistic 
tends  to  be  too  conservative  and  the  proportion-scale  statistic  tends  to  be  too  liberal. 

This  behavior  of  the  Wald  statistic  for  the  logit  reflects  another  disadvantage.  When  a 
true  effect  is  relatively  large,  the  Wald  test  is  not  as  powerful  as  the  likelihood-ratio  and 
score  test  and  can  even  show  aberrant  behavior  (Hauck  and  Donner  1977).  For  the  single¬ 
binomial  case  just  described,  for  example,  suppose  n  =  25.  We  would  regard  y  =  24  as 
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stronger  evidence  against  Ho  than  y  =  23,  yet  the  logit  Wald  statistic  equals  9.7  when 
y  =  24  and  1 1.0  when  y  =  23.  For  comparison,  the  likelihood-ratio  statistics  are  26.3  and 
20.7. 

More  generally,  Hauck  and  Donner  showed  that  for  fixed  sample  size,  the  Wald  statistic 
for  testing  Hq\  p  =  0  in  the  logistic  model  eventually  starts  decreasing  and  actually  con¬ 
verges  toward  0  as  p  grows  unboundedly.  A  similar  result  holds  for  logistic  models  with 
multiple  predictors. 


5.3  LOGISTIC  MODELS  WITH  CATEGORICAL  PREDICTORS 

Like  ordinary  regression,  logistic  regression  extends  to  include  qualitative  explanatory 
variables,  often  called  factors ,  as  first  noted  by  Dyke  and  Patterson  ( 1 952).  We  use  indicator 
variables  to  do  this. 

5.3.1  ANOVA-Type  Representation  of  Factors 

For  simplicity,  we  first  consider  a  single  factor  X ,  with  /  categories.  In  row  i  of  the  /  x  2 
table,  let  y,  be  the  number  of  outcomes  in  the  first  column  (successes)  out  of  «,  trials.  We 
treat  y,  as  binomial  with  parameter  n,. 

The  logistic  regression  model  with  a  single  factor  as  a  predictor  is 

log  ,  =a  +  Pi.  (5.4) 

1  -  71  i 

The  higher  fr  is,  the  higher  the  value  of  n,.  The  right-hand  side  of  (5.4)  resembles  the 
model  formula  for  means  in  one-way  ANOVA. 

As  in  ANOVA,  the  factor  has  as  many  parameters  {/!,  }  as  categories.  Unless  we  delete 
o;  from  the  model,  one  ft  is  redundant.  One  ft  can  be  set  to  0,  say,  ft  =  0  for  the  last 
category.  If  the  values  do  not  satisfy  this,  we  can  recode  so  that  it  is  true.  For  instance,  set 
ft  —  fi  —  Pi  and  a  —  a  +  Pi ,  which  satisfy  pt  =  0.  Then 


logit(7r,)  =  a  +  ft  =(a  -  p, )  +  (ft  +  Pi)  =  a  +  pi. 


where  the  newly  defined  parameters  satisfy  the  constraint.  When  Pi  =  0,  a  equals  the  logit 
in  row  /,  and  Pi  is  the  difference  between  the  logits  in  rows  i  and  /.  Thus,  Pi  equals  the  log 
odds  ratio  for  that  pair  of  rows. 

For  any  {7 r,-  >  0},  {/!,}  exist  such  that  model  (5.4)  holds.  The  model  has  as  many 
parameters  (/)  as  binomial  observations  and  is  saturated.  When  a  factor  has  no  effect, 
P\  =  Pi  =  ■■■=  Pi-  Since  this  is  equivalent  to  7ir  =  •  •  •  =  it/,  this  case  corresponds  to 
statistical  independence  of  X  and  Y. 

5.3.2  Indicator  Variables  Represent  a  Factor 

An  equivalent  expression  ofmodel(5.4)usesindicatorvariables.Letx,  =  1  for  observations 
in  row  i  and  x,  =  0  otherwise,  /  =  1 , . . . ,  /  —  1 .  The  model  is 


logit(7T,)  =Qf  +  P\X\  +  PlX2-\ - \-  Pl-\X,~]. 
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This  accounts  for  parameter  redundancy  by  not  forming  an  indicator  variable  for  category  1 . 
The  constraint  Pi  =  0  corresponds  to  this  choice  of  indicator  variables.  The  category  to 
exclude  for  an  indicator  variable  is  arbitrary.  Some  software  sets  P\  =  0;  this  corresponds 
to  a  model  with  indicator  variables  for  categories  2  through  /,  but  not  category  1 . 

Another  way  to  impose  constraints  sets  JT  p ,  =  0.  When  X  has  1—2  categories,  then 
P\  =  — Pi ■  This  results  from  effect  coding  for  an  indicator  variable,  x  =  1  in  category  1 
and  .v  =  —  1  in  category  2. 

The  same  substantive  results  about  estimable  effects  occur  for  any  coding  scheme.  For 
model  (5.4),  regardless  of  the  constraint  for  {/?,■},  the  linear  predictor  values  {d  +  /?,  }  and 
hence  {ir,}  are  the  same.  The  differences  pa  —  Ph  for  pairs  ( a ,  b)  of  categories  of  X  are 
identical  and  represent  estimated  log  odds  ratios.  Thus,  exp(/3u  —  ph)  is  the  estimated  odds 
of  success  in  category  a  of  X  divided  by  the  estimated  odds  of  success  in  category  b  of  X. 
Reparameterizing  a  model  may  change  parameter  estimates  but  does  not  change  the  model 
fit  or  the  effects  of  interest. 

The  value  Pi  or  P,  for  a  single  category  is  irrelevant.  Different  constraint  systems 
result  in  different  values.  For  a  binary  predictor,  for  instance,  using  indicator  variables 
with  reference  value  Pi  =  0,  the  log  odds  ratio  equals  P\  —  Pi  —  P\\  by  contrast,  for 
effect  coding  with  ±1  indicator  variable  and  hence  P\  +  Pi  —  0,  the  log  odds  ratio  equals 
P\  —  Pi  =  P\  -  (~P i)  =  2P\.  A  parameter  or  its  estimate  makes  sense  only  by  comparison 
with  one  for  another  category. 


5.3.3  Example:  Alcohol  and  Infant  Malformation  Revisited 

We  return  now  to  Table  3.8  from  the  study  of  maternal  alcohol  consumption  and  child’s 
congenital  malformations,  shown  again  in  Table  5.3.  For  model  (5.4),  we  treat  malformation 
status  as  the  response  and  alcohol  consumption  as  an  explanatory  factor.  Regardless  of  the 
constraint  for  {/!,  },  the  model  is  saturated  and  {<5  +  P, )  are  the  sample  logits,  reported  in 
Table  5.3.  For  instance, 

logit(jfi)  =  cr  +  P\  =  Iog(48/17,  066)  =  —5.87. 

For  the  coding  that  constrains  p5  —  0,  a  =  —3.61  and  P\  =  —2.26.  For  the  coding  P\  =  0, 
d  =  —5.87.  Table  5.3  shows  that  except  for  the  slight  reversal  between  the  first  and  second 
categories  of  alcohol  consumption,  the  sample  logits  and  hence  the  sample  proportions  of 
malformation  cases  increase  as  alcohol  consumption  increases. 


Table  5.3  Sample  Logits  and  Proportion  of  Malformation  for  Table  3.8, 
with  Fitted  Proportions  for  Linear  Logit  Model 


Alcohol 

Consumption 

Malformation 

Sample 

Logit 

Proportion  Malformed 

Present 

Absent 

Observed 

Fitted 

0 

48 

17,066 

-5.87 

0.0028 

0.0026 

<1 

38 

14,464 

-5.94 

0.0026 

0.0030 

1-2 

5 

788 

-5.06 

0.0063 

0.0041 

3-5 

1 

126 

-4.84 

0.0079 

0.0091 

>6 

1 

37 

-3.61 

0.0263 

0.0231 
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The  simpler  model  with  all  /3,  =  0  specifies  independence.  For  it,  a  equals  the  logit 
for  the  overall  sample  proportion  of  malformations,  which  is  log(93/32,  481)  =  —5.86. 
To  test  Hq:  independence  (df  =  4),  the  Pearson  statistic  (3.10)  is  X2  =  12.1  (P  =  0.02), 
and  the  likelihood-ratio  statistic  (3.1 1)  is  G 2  =  6.2  (P  =0.19).  These  provide  mixed  sig¬ 
nals.  Table  5.3  has  a  mixture  of  very  small,  moderate,  and  extremely  large  counts.  Even 
though  n  —  32,  574,  the  null  sampling  distributions  of  X 2  or  G 2  may  not  be  close  to 
chi-squared.  The  P-values  using  the  exact  conditional  distributions  of  X2  and  G 2  (Sec¬ 
tion  16.5.2)  are  0.03  and  0.13.  These  are  closer,  but  still  give  differing  evidence.  In  any 
case,  these  statistics  ignore  the  ordinality  of  alcohol  consumption.  The  sample  suggests 
that  malformations  may  tend  to  be  more  likely  with  higher  alcohol  consumption.  The  first 
two  proportions  are  similar  and  the  next  two  are  also  similar,  however,  and  either  of  the  last 
two  proportions  changes  substantially  with  the  addition  or  deletion  of  one  malformation 
case. 


5.3.4  Linear  Logit  Model  for  /  x  2  Contingency  Tables 

Model  (5.4)  is  invariant  to  the  ordering  of  categories,  so  it  treats  the  explanatory  factor 
as  nominal.  For  ordered  factor  categories,  other  models  are  more  parsimonious,  yet  more 
complex  than  the  independence  model.  For  instance,  let  (at,  ah,  . . . ,  A/)  be  scores  that 
describe  distances  between  categories  of  X.  When  we  expect  a  monotone  effect  of  X  on  T, 
it  is  natural  to  fit  the  linear  logit  model 

logit(rr, )  =  a  +  /3a,  .  (5.5) 

The  independence  model  is  the  special  case  ft  =  0. 

The  near-monotone  increase  in  the  sample  logits  in  Table  5.3  indicates  that  the  lin¬ 
ear  logit  model  may  fit  better  than  the  independence  model.  As  measured,  alcohol  con¬ 
sumption  groups  a  naturally  continuous  variable.  With  scores  (ai  =  0,  A2  =  0.5,  A3  =  1 .5, 
a4  =  4.0,  A5  =  7.0),  the  last  score  being  somewhat  arbitrary.  Table  5.4  shows  results.  The 
estimated  multiplicative  effect  of  a  unit  increase  in  daily  alcohol  consumption  on  the  odds 
of  malformation  is  exp(0.3 1 7)  =  1 .37.  Table  5.3  shows  the  observed  and  fitted  proportions 
of  malformation.  The  model  seems  to  fit  well,  as  statistics  comparing  observed  and  fitted 
counts  are  G2  =  1 .95  and  X2  —  2.05,  with  df  =  3. 


Table  5.4  Software  Output  (Based  on  SAS)  for  Linear  Logit  Model  Fitted  to 
Table  5.3  on  Infant  Malformation  and  Alcohol  Consumption 


Criteria  For  Assessing  Goodness  Of  Fit 
Criterion  DF  Value 

Deviance  3  1.9487 

Pearson  Chi-Square  3  2.0523 

Log  Likelihood  —635.5968 


Parameter  Estimate 
Intercept  —5.9605 
alcohol  0.3166 


Std  Likelihood-Ratio 
Error  95%  Conf  Limits 
0.1154  -6.1930  -5.7397 
0.1254  0.0187  0.5236 


Wald 

Chi-Sq  Pr>ChiSq 
2666.41  <.0001 

6.37  0.0116 
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5.3.5  Cochran-Armitage  Trend  Test 

Armitage  (1955)  and  Cochran  (1954)  were  among  the  first  to  emphasize  the  importance  of 
utilizing  ordered  categories  in  a  contingency  table.  For  1x2  tables  with  ordered  rows  and  / 
independent  bin(/;,,  tt,)  variates  {y, },  they  proposed  a  trend  statistic  fortesting  independence 
by  partitioning  the  Pearson  statistic  for  that  hypothesis.  They  used  a  linear  probability  model. 

Hi  =  a  +  pXj,  (5.6) 

fitted  by  ordinary  least  squares.  The  null  hypothesis  of  independence  is  Ho :  ft  =  0.  Let 
x  —  E,-  rijXi/ n.  Let  /?,  —  y ,■/«,,  and  let  p  =  (E;  y,)/ n  denote  the  overall  proportion  of 
successes.  The  prediction  equation  is 


A/  =  p  +  b(xi  -  x), 


where 


b  = 


E/  n,(pi  -  p){x,  -  x) 


E;  ni(Xi  -  x)2 


Denote  the  Pearson  statistic  for  testing  independence  by  X2(I).  We  express  X2(l)  in 
terms  of  variation  among  the  /  sample  proportions  by 


*2(0  =  Ti - ~ 

P(  1  -  p)  *-rJ 


Reported  by  Fisher  (1934)  and  attributed  to  A.  E.  Brandt  and  G.  W.  Snedecor,  this  is  referred 
to  as  the  Brandt-Snedecor formula.  It  generalizes  the  equality  in  2  x  2  tables  between  X2 
and  the  square  of  the  pooled  two-sample  z-statistic  (3.12).  Cochran  (1954)  noted  that  this 
Pearson  formula  decomposes  into 


X\l)  =  z2  +  X2(L), 


where 


X2(L)  = 


1 


z2  = 


P(  1  -  P)  ; 

b2 


^2  ni(Pi  ~ 


P  (1  -  P)  ; 


YMx.  “  x)2  = 


E/U,  ~x)yi 


\/p(\  ~  P)Y.ini(xi  -*)2J 


(5.7) 


When  the  linear  probability  model  holds,  X2{L)  is  asymptotically  chi-squared  with  df  = 
/  —  2.  It  tests  the  fit  of  the  model.  The  statistic  z2,  with  df  =  1,  tests  Ho',  ft  —  0  for  the 
linear  trend  (5.6)  in  the  proportions.  The  test  of  independence  using  this  statistic  is  called 
the  Cochran-Armitage  trend  test. 

This  statistic  relates  to  the  correlation-based  statistic  M2  introduced  in  (3.16)  in  Sec¬ 
tion  3.4.1  to  test  for  a  linear  trend  in  an  /  x  J  table;  namely,  z2  =  nr2  —  [n/(n  —  1  )]M2. 
See  Yates  (1948)  and  Mantel  (1963).  When  1=2,  then  X2(L)  =  0  and  z2  =  X2(l). 
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The  Cochran-Armitage  trend  test  seems  unrelated  to  the  linear  logit  model.  However, 
this  test  statistic  is  equivalent  to  the  score  statistic  for  testing  Ho :  ft  =  0  in  that  model.  In 
fact,  Tarone  and  Gart  (1980)  showed  that  the  score  test  for  a  binary  linear  trend  model  does 
not  depend  on  the  link  function.  Thus,  this  trend  test  is  locally  asymptotically  efficient  for 
both  linear  and  logistic  alternatives  for  P(Y  =  1).  See  Cox  (1958a)  for  related  remarks. 
Gross  (1981)  showed  that  when  the  linear  logit  model  holds  but  we  use  an  incorrect  set  of 
scores,  the  local  asymptotic  relative  efficiency  for  testing  independence  using  the  statistic 
with  those  scores  equals  the  square  of  the  Pearson  correlation  between  the  true  and  the 
incorrect  scores. 

5.3.6  Example:  Alcohol  and  Infant  Malformation  Revisited 

For  Table  5.3  on  alcohol  consumption  and  infant  malformation,  X2(I)  —  12.08.  Using  the 
scores  (0,  0.5,  1.5,  4.0,  7.0)  as  in  the  linear  logit  model,  the  Cochran-Armitage  trend  test 
has  z2  =  6.57  (P-value  =  0.010).  The  test  suggests  strong  evidence  of  a  positive  slope.  In 
addition. 


X2(/)  =  12.08  =  6.57  +  5.51, 

where  X2(L)  —  5.51  (df  —  3)  shows  only  slight  evidence  of  departure  of  the  proportions 
from  linearity.  The  trend  test  result  is  nearly  identical  to  the  test  using  M2  —  (n  —  l)r2 
based  on  the  sample  correlation  of  r  =  0.0142  for  n  —  32,  573.  For  the  chosen  scores,  the 
correlation  seems  weak.  However,  r  has  limited  use  as  a  descriptive  measure  for  tables  that 
are  highly  discrete  and  unbalanced. 

The  Cochran-Armitage  trend  test  (i.e.,  the  score  test)  usually  gives  results  similar  to 
the  Wald  or  likelihood-ratio  test  of  Ho :  ft  —  0  in  the  linear  logit  model.  The  asymptotics 
work  well  even  for  quite  small  n  when  {n,}  are  equal  and  {x,}  are  equally  spaced.  With 
Table  5.3,  the  Wald  statistic  equals  0/SE)2  =  (0.31 66/0. 1254)2  =  6.37  {P  —  0.012)  and 
the  likelihood-ratio  statistic  equals  4.25  (P  —  0.039).  Here,  however,  the  highly  unbalanced 
counts  suggest  that  it  is  best  not  to  use  the  Wald  approach  for  testing  or  for  interval 
estimation.  The  profile  likelihood  95%  confidence  interval  of  (0.02,  0.52)  for  ji  reported 
in  Table  5.4  is  preferable  to  the  Wald  interval  of  0.3 17  ±  1 .96(0. 1 25)  =  (0.07,  0.56).  The 
sample  size  in  the  last  row  is  relatively  small,  and  the  single  “present”  observation  in 
that  row  is  highly  influential.  P-values  depend  dramatically  on  whether  that  observation  is 
included  in  the  analysis  (Exercise  5.10). 


5.3.7  Using  Directed  Models  Can  Improve  Inferential  Power 

When  contingency  tables  have  ordered  categories,  in  Section  3.4  we  showed  that  tests  that 
utilize  the  ordering  can  have  improved  power.  Testing  independence  against  a  linear  trend 
alternative  in  a  linear  logit  model  is  a  way  to  do  this.  In  this  section  we  present  the  reason 
for  these  power  improvements. 

In  an  /  x  2  contingency  table  for  /  binomial  variates  with  parameters  { tt,  } ,  //0:  inde¬ 
pendence  states  logit(7T, )  =  a.  The  ordinary  G 2  and  X2  statistics  of  Section  3.2.1  refer  to 
the  general  alternative, 


logit(TTi)  -a  +  Pi, 
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which  is  saturated.  They  test  H0:  =  ■  ■  ■  =  Pi  in  that  model,  with  df  =  (/  —  1). 

Their  general  alternative  treats  both  classifications  as  nominal.  Denote  these  test  statistics 
asG2(/)andX2(/).NotethatG2(/)isthe  likelihood-ratio  statistic  G2(Mq  \  M\ )  =  —  2(L0  — 
Li)  for  comparing  the  saturated  model  M\  with  the  independence  (/)  model  Mq. 

Ordinal  test  statistics  refer  to  narrower,  often  more  relevant,  alternatives.  With  ordered 
rows,  an  example  is  a  test  of  Hq :  fi  —  0  in  the  linear  logit  model,  logitfjr,  )  =  a  +  fix, .  The 
likelihood-ratio  statistic  G2(/|Z.)  —  G2(/)  —  G2(L)  compares  the  linear  logit  model  and 
the  independence  model.  When  a  test  statistic  focuses  on  a  single  parameter,  such  as  /J  in 
that  model,  it  has  df  =  1 .  Now,  df  equals  the  mean  of  the  chi-squared  distribution.  A  large 
test  statistic  with  df  =  1  falls  farther  out  in  its  right-hand  tail  than  a  comparable  value  of 
X2(l )  or  G2(/ )  with  df  =  (/  —  1).  Thus,  it  has  a  smaller  P-value. 

5.3.8  Noncentral  Chi-Squared  Distribution  and  Power  for  Narrower  Alternatives 

To  compare  power  of  G2(I\L)  and  G2(/),  it  is  necessary  to  compare  their  nonnull  sam¬ 
pling  distributions.  When  Hq  is  false,  their  distributions  are  approximately  noncentral 
chi-squared.  This  distribution,  introduced  by  R.  A.  Fisher  in  1928,  arises  from  the  fol¬ 
lowing  construction:  If  Z,  ~  N  ( /J.j ,  1),  i  =  1, . . . ,  v,  and  if  Zt, ...  ,ZV  are  independent, 
Zf  has  the  noncentral  chi-squared  distribution  with  df  =  v  and  noncentrality  param¬ 
eter  X  =  /x?.  Its  mean  is  v  +  X  and  its  variance  is  2(v  +  2X).  The  ordinary  (central) 

chi-squared  distribution,  which  occurs  when  Hq  is  true,  has  X  =  0. 

Let  X2,  x  denote  a  noncentral  chi-squared  random  variable  with  df  =  v  and  noncentral¬ 
ity  X.  A  fundamental  result  for  chi-squared  analyses  is  that,  for  fixed  X, 

P[X2x  >  x2(<*)]  increases  as  v  decreases. 

That  is,  the  power  for  rejecting  Hq  at  a  fixed  a-level  increases  as  the  df  of  the  test  decreases 
(Das  Gupta  and  Perlman  1974).  For  fixed  v,  the  power  equals  a  when  X  =  0,  and  it 
increases  as  X  increases.  The  inverse  relation  between  power  and  df  suggests  that  focusing 
the  noncentrality  on  a  statistic  having  a  small  df  value  can  improve  power. 

Suppose  that  an  explanatory  variable  has,  at  least  approximately,  a  linear  effect  on 
logit[F(T  =  1)].  To  test  independence  with  reasonable  power,  it  is  then  sensible  to  use  a 
statistic  based  on  the  linear  logit  model,  using  the  likelihood-ratio  statistic  G2(/|L),  the 
Wald  statistic  z  =  $/SE,  and  the  Cochran-Armitage  (score)  statistic.  When  is  G2(/|L) 
more  powerful  than  G2(/)?  The  statistics  satisfy 

G2(/)  =  G2(/  \L)  +  G2(L), 

where  G2(L)  tests  goodness  of  fit  of  the  linear  logit  model.  When  the  linear  logit  model 
holds,  G2(L)  has  an  asymptotic  chi-squared  distribution  with  df  =  /  —  2;  then  if  j3  ^ 
0,  G2(/ )  and  G2(I \L )  both  have  approximate  noncentral  chi-squared  distributions  with  the 
same  noncentrality.  Whereas  df  —  /  —  1  for  G2(l),  df  —  1  for  G2(1\L).  Thus,  G2(/ \L)  is 
more  powerful,  because  it  uses  fewer  degrees  of  freedom. 

When  the  linear  logit  model  does  not  hold,  G2(/ )  has  greater  noncentrality  than  G2(I  \  L), 
the  discrepancy  increasing  as  the  model  fits  more  poorly.  However,  when  the  model  ap¬ 
proximates  reality  fairly  well,  usually  G2(/ \L)  is  still  more  powerful.  That  test’s  df  value  of 
1  more  than  compensates  for  its  loss  in  noncentrality.  The  closer  the  true  relationship  is  to 
the  linear  logit,  the  more  nearly  G2(I\L  )  captures  the  same  noncentrality  as  G2(/),  and  the 
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more  powerful  it  is  compared  with  G2(l).  To  illustrate,  Figure  5.5  plots  power  as  a  function 
of  noncentrality  when  df  =  1  and  7.  When  the  noncentrality  of  a  test  having  df  =  1  is  at 
least  about  half  that  of  a  test  having  df  =  7,  the  test  with  df  =  1  is  more  powerful.  The 
linear  logit  model  then  helps  detect  a  key  component  of  an  association.  As  Mantel  (1963) 
argued  in  a  similar  context,  “that  a  linear  regression  is  being  tested  does  not  mean  that 
an  assumption  of  linearity  is  being  made.  Rather  it  is  that  test  of  a  linear  component  of 
regression  provides  power  for  detecting  any  progressive  association  which  may  exist.” 

The  improved  power  for  the  linear  trend  statistic  results  from  sacrificing  power  in  other 
cases.  The  G2(/)  test  can  have  greater  power  than  G2(I\L)  when  the  linear  logit  model 
describes  the  true  relationship  very  poorly. 


5.3.9  Example:  Skin  Damage  and  Leprosy 

Table  5.5  refers  to  an  experiment  on  the  use  of  sulfones  and  streptomycin  drugs  in  the 
treatment  of  leprosy.  The  degree  of  infiltration  at  the  start  of  the  experiment  measures  a 
type  of  skin  damage.  The  response  is  the  change  in  the  overall  clinical  condition  of  the 
patient  after  48  weeks  of  treatment.  We  use  response  scores  (—1,0,  1,2,  3).  The  question 
of  interest  is  whether  subjects  with  high  infiltration  changed  differently  from  those  with 
low  infiltration. 


Table  5.5  Change  in  Clinical  Condition  by  Degree  of  Infiltration 


Degree  of  Infiltration 

Proportion 

Clinical  Change 

High 

Low 

High 

Worse 

1 

1  1 

0.08 

Stationary 

13 

53 

0.20 

Slight  improvement 

16 

42 

0.28 

Moderate  improvement 

15 

27 

0.36 

Marked  improvement 

7 

1 1 

0.39 

Source :  Reprinted  with  permission  from  the  Biometric  Society  (Cochran  1954). 
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The  test  G2(/)  =  7.28  (df  =  4)  does  not  show  much  evidence  of  association  (P  —  0. 12), 
but  it  ignores  that  the  clinical  change  response  variable  is  ordinal.  It  seems  natural  to 
compare  the  mean  change  for  the  two  infiltration  levels.  Cochran  (1954)  and  Yates  (1948) 
noted  that  this  analysis  is  identical  to  a  trend  test  treating  the  binary  variable  as  the  response. 
In  fact,  the  sample  proportion  of  high  infiltration  increases  monotonically  as  the  clinical 
change  improves.  The  test  of  Hq:  fi  —  0  in  the  linear  logit  model  has  G2(/ 1 L)  —  6.65,  with 
df  =  1  (P  =  0.01).  It  gives  strong  evidence  of  more  positive  clinical  change  at  the  higher 
level  of  infiltration.  Using  the  ordering  by  decreasing  df  from  4  to  1  pays  a  strong  dividend. 
In  addition,  G2(L )  =  0.63  with  df  =  3  suggests  that  the  linear  trend  model  fits  well. 

5.3.10  Model  Smoothing  Improves  Precision  of  Estimation 

Using  directed  alternatives  can  improve  not  only  test  power ,  but  also  estimation  of  cell 
probabilities  and  summary  measures.  In  generic  form,  let  n  be  true  cell  probabilities  in 
a  contingency  table,  let  p  denote  sample  proportions,  and  let  A  denote  model-based  ML 
estimates  of  n . 

When  k  satisfies  a  certain  model,  both  A  for  that  model  and  p  are  consistent  estimators 
of  7r.  The  model-based  estimator  A  is  better,  as  its  true  asymptotic  standard  error  cannot 
exceed  that  of  p.  This  happens  because  of  model  parsimony:  The  unsaturated  model,  on 
which  A  is  based,  has  fewer  parameters  than  the  saturated  model,  on  which  p  is  based.  In 
fact,  model-based  estimators  are  also  more  efficient  in  estimating  functions  g(Ji)  of  cell 
probabilities.  For  any  differentiable  function  g, 

asymp.  var[,y«#(jr)]  <  asymp.  varf^/ig)/?)]. 

In  Section  16.2.3  we  show  formulas.  The  result  holds  more  generally  than  for  categorical 
data  models  (Altham  1984),  a  reason  that  statisticians  prefer  parsimonious  models. 

In  reality,  of  course,  a  chosen  model  is  unlikely  to  hold  exactly.  However,  when  the 
model  approximates  n  well,  unless  n  is  extremely  large,  A  is  still  better  than  p.  Although 
A i  is  biased,  it  has  smaller  variance  than  p ,,  and  MSEUf,)  <  MSE(/?,)  when  its  variance 
plus  squared  bias  is  smaller  than  var(p,  ).  In  Section  3.3.8,  for  example,  we  showed  that 
independence-model  estimates  of  cell  probabilities  in  two-way  tables  can  be  much  better 
than  sample  proportions  even  when  that  model  does  not  hold. 


5.4  MULTIPLE  LOGISTIC  REGRESSION 

Like  ordinary  regression,  logistic  regression  extends  to  models  with  multiple  explanatory 
variables,  which  can  be  a  mixture  of  quantitative  and  qualitative  (Cox  1958).  The  model 
for  nix)  =  P(Y  —  1)  at  values  x  =  (x\ , . . . ,  xp)  of  p  predictors  is 

logit[7r(x)]  —  a  +  fi\X\  +  i 62x2  H - h  (5.8) 

The  alternative  formula,  directly  specifying  tt(x),  is 


exp(a  +  fi |.V|  +  f)2x 2  j - 1-  PPXp) 

1  +  exp(a  +  /8|,vi  +  p2xi  H - F  Ppxp)' 


(5.9) 
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For  qualitative  predictors,  we  use  indicator  variables  for  its  categories. 

The  parameter  Pj  refers  to  the  effect  of  Xj  on  the  log  odds  that  Y  =  1 ,  adjusting  for  the 
other  x^.  For  instance,  exp (Pj)  is  the  multiplicative  effect  on  the  odds  of  a  1-unit  increase 
in  Xj ,  when  we  can  keep  fixed  the  levels  of  other  xt- 


5.4.1  Logistic  Models  for  Multiway  Contingency  Tables 

When  all  variables  are  categorical,  a  multiway  contingency  table  displays  the  data.  We 
illustrate  ideas  with  binary  predictors  X  and  Z.  We  treat  the  sample  size  at  given  combina¬ 
tions  of  X  and  Z  as  fixed  and  regard  the  two  counts  on  Y  at  each  setting  as  binomial,  with 
different  binomials  treated  as  independent.  We  let  indicator  variables  x  and  z  take  value  1 
in  the  first  category  and  0  in  the  second.  The  model 


logit  [P(Y  =  \)]=a  +  Pix  +  P2Z  (5.10) 


has  main  effects  for  X  andZ  but  assumes  an  absence  of  interaction.  The  effect  of  one  factor 
is  the  same  at  each  level  of  the  other. 

At  a  fixed  level  of  Z,  the  effect  on  the  logit  of  changing  categories  of  X  is 

[a  +  ^i(l)  +  P22]  —  [a  +  ^(0)  +  P22]  =  Pi-  (5.1 1) 

This  logit  difference  equals  the  difference  of  log  odds,  which  is  the  log  odds  ratio  between 
X  and  Y,  fixing  Z.  Thus,  exp(/l|)  is  the  conditional  odds  ratio  between  X  and  Y .  Adjusting 
for  Z,  the  odds  of  success  when  x  =  1  equal  exp(/J|)  times  the  odds  when  x  —  0.  This 
conditional  odds  ratio  is  the  same  at  each  level  of  z;  that  is,  there  is  homogeneous  XY 
association  (Section  2.3.5).  The  lack  of  an  interaction  term  implies  a  common  odds  ratio 
for  the  partial  tables.  When  P\  =  0,  that  common  odds  ratio  equals  1 .  Then  X  and  Y  are 
independent  in  each  partial  table,  or  conditionally  independent,  given  Z  (Section  2.3.4). 

Additivity  on  the  logit  scale  is  the  usual  definition  of  no  interaction  for  categorical 
variables.  However,  it  could  instead  be  defined  as  additivity  on  some  other  scale,  such 
as  with  probit  or  identity  link.  Interaction  can  occur  on  one  scale  when  there  is  none  on 
another  scale.  In  some  applications,  a  particular  definition  may  be  natural.  For  instance, 
theory  might  assume  an  underlying  normal  distribution  for  Y  and  predict  that  the  probit  is 
an  additive  function  of  predictor  effects. 

A  factor  with  /  categories  needs  /  —  1  indicator  variables.  With  /  categories  for  X  and 
K  categories  forZ,  model  (5.10)  extends  to 


logit[/)(T  —  1)]— a  +  pf x\  +  ■■■  +  Pf_ \X/ —\  +  pf Z\  +  •  ■  •  +  Pf  \Xk-\, 

where,  for  example,  z*  =  1  for  observations  in  category  k  of  Z  and  zt  —  0  otherwise, 
k  =  \  , ,  K  —  1.  This  equation  represents  effects  of  X  with  parameters  { pf }  and  effects 
of  Z  with  parameters  {pf }.  The  X  and  Z  superscripts  are  merely  labels  and  do  not  represent 
powers.  This  model  form  applies  for  any  number  of  categories  for  X  and  Z.  The  parameter 
Pf,  for  example,  denotes  the  effect  on  the  logit  of  classification  in  category  k  of  Z  instead 
of  category  K. 
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An  alternative  representation  of  such  factors  resembles  the  way  that  ANOVA  factorial 
models  often  express  them.  The  equivalent  model  formula  is 

logit[P(K=  1  )]=«  +  ft X+ftZ.  (5.12) 

For  each  factor,  one  parameter  is  redundant.  Fixing  one  at  0,  such  as  =  /)|  =  0, 
represents  the  category  not  having  its  own  indicator  variable.  Conditional  independence 
between  X  and  Y,  given  Z,  corresponds  to  /if  =  /if  =  •  •  •  =  /if ,  whereby  P(Y  =  1)  does 
not  change  as  i  changes,  for  fixed  k. 


5.4.2  Example:  AIDS  and  AZT  Use 

Table  5.6  is  from  a  study  on  the  effects  of  AZT  in  slowing  the  development  of  AIDS 
symptoms.  In  the  study,  338  veterans  whose  immune  systems  were  beginning  to  falter 
after  infection  with  HIV  were  randomly  assigned  either  to  receive  AZT  immediately  or 
to  wait  until  their  T  cells  showed  severe  immune  weakness.  Table  5.6  cross-classifies  the 
veterans’  race,  whether  they  received  AZT  immediately,  and  whether  they  developed  AIDS 
symptoms  during  the  3-year  study. 

In  model  (5.10),  we  identify  X  with  AZT  treatment  (x  =  1  for  immediate  AZT  use, 
x  =  0  otherwise)  and  Z  with  race  (z  =  1  for  whites,  z  =  0  for  blacks),  for  predicting  the 
probability  that  AIDS  symptoms  developed.  Thus,  a  is  the  log  odds  of  developing  AIDS 
symptoms  for  black  subjects  without  immediate  AZT  use,  ft  is  the  increment  to  the  log 
odds  for  those  with  immediate  AZT  use,  and  ft  is  the  increment  to  the  log  odds  for 
white  subjects.  Table  5.7  shows  output.  The  estimated  odds  ratio  between  immediate  AZT 
use  and  development  of  AIDS  symptoms  equals  exp(— 0.7195)  =  0.487.  For  each  race, 
the  estimated  odds  of  symptoms  are  half  as  high  for  those  who  took  AZT  immediately. 
The  Wald  confidence  interval  for  this  effect  is  exp[— 0.720  ±  1 .96(0.279)]  =  (0.28,  0.84). 
Similar  results  occur  for  the  likelihood-based  interval,  as  shown. 

The  hypothesis  of  conditional  independence  of  AZT  treatment  and  development  of 
AIDS  symptoms,  controlling  for  race,  is  Hq:  ft  =  0  in  (5. 10).  The  likelihood-ratio  statistic 
comparing  the  model  with  the  simpler  model  having  ft  =  0  equals  6.87  (df  =  1),  showing 
evidence  of  association  (P  =  0.01).  The  Wald  statistic  (ft/S£)2  =  (—0. 7195/0. 279)2  = 
6.65,  shown  in  the  output,  provides  similar  results. 


Table  5.6  Development  of  AIDS 
Symptoms  by  AZT  Use  and  Race 


Race 

AZT  Use 

Symptoms 

Yes  No 

White 

Yes 

14 

93 

No 

32 

81 

Black 

Yes 

11 

52 

No 

12 

43 

Source:  The  New  York  Times,  Feb.  15,  1991 . 
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Table  5.7  Software  Output  (Based  on  SAS)  for  Logistic  Model  with  AIDS  Symptoms  Data 


Goodness-of -Fit  Statistics 


Criterion 

DF  Value 

Pr  >  ChiSq 

Deviance 

1  1 

.3835 

0.2395 

Pearson 

1  1 

.3910 

0.2382 

Analysis 

of  Maximum  Likelihood  Estimates 

Parameter 

Estimate 

Std  Error 

Wald  Chi- 

Sq  Pr  >  ChiSq 

Intercept 

-1.0736 

0.2629 

16 .6705 

<  .0001 

azt 

-0.7195 

0.2790 

6 . 6507 

0 . 0099 

race 

0.0555 

0.2886 

0 . 0370 

0 . 8476 

Obs  race 

azt 

y 

n 

pi.hat 

lower  upper 

1 

1 

1 

14 

107 

0 . 14962 

0.09897  0.21987 

2 

1 

0 

32 

113 

0 . 26540 

0.19668  0.34774 

3 

0 

l 

11 

63 

0 . 14270 

0.08704  0.22519 

4 

0 

0 

12 

55 

0 . 25472 

0.16953  0.36396 

Profile 

Effect 

azt 

race 


Like.  Cl  for  Odds  Ratios 

Estimate  95%  Conf  Limits 

0.487  0.279  0.835 

1.057  0.605  1.884 


Table  5.8  shows  parameter  estimates  for  three  ways  of  defining  factor  parameters  in 
(5.12):  (1)  setting  the  last  parameter  equal  to  0,  (2)  setting  the  first  parameter  equal  to  0, 
and  (3)  having  parameters  sum  to  zero.  This  corresponds  to  setting  up  indicator  variables 
for  each  category  except  the  last  in  scheme  ( 1 ),  for  each  category  except  the  first  in  scheme 
(2).  In  scheme  (3),  there  is  also  a  reference  category,  and  for  other  categories  the  indicator 
is  1  for  an  observation  in  the  category,  —  1  for  an  observation  in  the  reference  category, 
and  0  otherwise.  For  each  coding  scheme,  at  a  given  combination  of  AZT  use  and  race,  the 
estimated  probability  of  developing  AIDS  symptoms  is  the  same.  For  instance,  the  intercept 
estimate  plus  the  estimate  for  immediate  AZT  use  plus  the  estimate  for  being  white  is 
—  1.738  for  each  scheme,  so  the  estimated  probability  that  white  veterans  with  immediate 
AZT  use  develop  AIDS  symptoms  equals  exp(— 1.738)/[1  +  exp(— 1 .738)]  =  0.15.  The 
bottom  of  Table  5.7  shows  point  and  interval  estimates  of  the  probabilities.  Figure  5.6 


Table  5.8  Parameter  Estimates  for  Logistic  Model  Fitted  to 
Table  5.6  on  AIDS  and  AZT  Use 


Parameter 

Definition  of  Parameters 

Last  =  Zero 

First  =  Zero 

Sum  =  Zero 

Intercept 

-1.074 

-1.738 

-1.406 

AZT  Yes 

-0.720 

0.000 

-0.360 

No 

0.000 

0.720 

0.360 

Race  White 

0.055 

0.000 

0.028 

Black 

0.000 

-0.055 

-0.028 
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Figure  5.6  Estimated  effects  of  AZT  use  and  race  on  probability  of  developing  AIDS  symptoms  (dots  are 
sample  proportions). 


shows  a  graphical  representation  of  the  sample  proportions  (the  four  dots)  and  the  point 
estimates  plus  and  minus  a  standard  error. 

For  each  coding  scheme,  fi*  —  ft*  >s  identical  and  represents  the  conditional  log  odds 
ratio  of  X  with  the  response,  given  Z.  Here,  expf^f  —  /6X)  =  exp(— 0.720)  —  0.49estimates 
the  common  odds  ratio  between  immediate  AZT  use  and  AIDS  symptoms,  for  each  race. 

5.4.3  Goodness  of  Fit  as  a  Likelihood-Ratio  Test 

The  likelihood-ratio  statistic  G2(Mo|M|)  —  —2(Lq  —  L\)  tests  whether  certain  model  pa¬ 
rameters  are  zero,  given  that  M\  holds,  by  comparing  the  log  likelihood  L\  for  the  fitted 
model  Mi  with  L$  for  a  simpler  model  Mo-  The  goodness-of-fit  statistic  G2(M)  is  a  special 
case  in  which  Mo  =  M  and  M\  is  the  saturated  model.  In  testing  whether  M  fits,  we  test 
whether  all  parameters  in  the  saturated  model  but  not  in  M  equal  zero.  The  asymptotic  df 
is  the  difference  in  the  number  of  parameters  in  the  two  models,  which  is  the  number  of 
binomials  modeled  minus  the  number  of  parameters  in  M. 

We  illustrate  by  checking  the  fit  of  model  (5.10)  for  the  AIDS  data.  For  its  fit,  white 
veterans  with  immediate  AZT  use  had  estimated  probability  0.150  of  developing  AIDS 
symptoms  during  the  study.  Since  107  white  veterans  took  AZT,  the  fitted  value  is 
107(0.150)  =  16.0  for  developing  symptoms  and  107(0.850)  =  91.0  for  not  developing 
them.  Similarly,  we  can  obtain  fitted  values  for  all  eight  cells  in  Table  5.6.  The  goodness- 
of-fit  statistics  comparing  these  with  the  cell  counts  are  G2  =  1.38  and  X2  =  1.39.  The 
model  has  four  binomials,  one  at  each  combination  of  AZT  use  and  race.  Since  it  has  three 
parameters,  residual  df  =  4  —  3  =  1.  The  small  G2  and  X2  values  suggest  that  the  model 
fits  decently  (P  >  0.2). 

For  model  (5.10),  the  odds  ratio  between  X  and  Y  is  the  same  at  each  level  of  Z. 
The  goodness-of-fit  test  checks  this  structure.  That  is,  the  test  also  provides  a  test  of 
homogeneous  odds  ratios.  For  Table  5.6,  homogeneity  is  plausible.  Since  residual  df  =  1, 
the  more  complex  model  that  adds  an  interaction  term  and  permits  the  two  odds  ratios  to 
differ  is  saturated. 
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5.4.4  Model  Comparison  by  Comparing  Deviances 

Let  Ls  denote  the  maximized  log  likelihood  for  the  saturated  model.  As  discussed  in 
Section  4.5.4,  the  likelihood-ratio  statistic  for  comparing  models  M\  and  Mq  is 

G2(Mq\M\)  =  —2(Lq  -  Li)  =  -2 (L„  -  Ls)  -  [-2(1,  -  Ls)]  =  G2(M0)  -  G2(A7,). 

The  test  statistic  comparing  two  models  is  identical  to  the  difference  in  G2  goodness-of-fit 
statistics  (deviances)  for  the  two  models.  To  illustrate,  consider  Ho’.  P2  =  0  for  the  race 
effect  with  the  AIDS  data.  The  likelihood-ratio  statistic  equals  0.04,  suggesting  that  the 
simpler  model  is  adequate.  But  this  equals  G2(Mq)  —  G2(M ,)  =  1.42  —  1.38,  where  Mq  is 
the  simpler  model  with  P2  =  0. 

The  model  comparison  statistic  often  has  an  approximate  chi-squared  null  distribution 
even  when  separate  G2(M,)  do  not.  For  instance,  when  at  least  one  predictor  is  continuous 
or  a  contingency  table  has  very  small  fitted  values,  the  sampling  distribution  of  G2(A/,) 
may  be  far  from  chi-squared.  Nonetheless,  if  df  for  the  comparison  statistic  is  modest  (as 
in  comparing  two  models  that  differ  by  at  most  a  few  parameters),  the  null  distribution  of 
G2(Mq\M\)  is  approximately  chi-squared. 

5.4.5  Example:  Horseshoe  Crab  Satellites  Revisited 

For  the  horseshoe  crab  data,  we  next  use  both  the  female  crab’s  carapace  width  and  color  as 
predictors  of  Y  —  whether  the  crab  has  at  least  one  satellite  ( 1  =  yes,  0  =  no).  Color  has  five 
categories:  light,  medium  light,  medium,  medium  dark,  dark.  It  is  a  surrogate  for  age,  older 
crabs  tending  to  be  darker.  The  sample  contained  no  light  crabs,  so  our  models  use  only 
the  other  four  categories.  We  first  treat  color  as  qualitative.  The  four  categories  use  three 
indicator  variables.  The  model  for  the  probability  that  the  crab  has  at  least  one  satellite  is 

logit[P(T  =  1)]  =  or  +  P\C\  +P2C2  +  ftc'3  +  P*x,  (5.13) 

where  x  —  width  in  centimeters,  and 

c’l  =  1  for  medium-light  color,  and  0  otherwise, 
c'2  =  1  for  medium  color,  and  0  otherwise, 

£•3  =  1  for  medium-dark  color,  and  0  otherwise. 

The  crab  color  is  dark  (category  4)  when  c \  =  ('2  =  C3  =  0.  Table  5.9  shows  the  ML 
parameter  estimates.  For  instance,  for  dark  crabs,  logit[L(T  =  1)]  =  —12.715  +  0.468x; 
by  contrast,  for  medium-light  crabs,  f|  —  1,  and  logit[L(T  =  1)]  =  (—12.715  +  1.330)  + 
0.468.V  =  —  1 1 .385  +  0.468.V.  At  the  average  width  of  26.3  cm,  P(Y  =  1)  =  0.399  for  dark 
crabs  and  0.715  for  medium-light  crabs.  The  exponentiated  difference  between  two  color 
parameter  estimates  is  an  odds  ratio  comparing  those  colors.  For  instance,  the  difference 
for  medium-light  crabs  and  dark  crabs  equals  1 .330.  At  any  given  width,  the  estimated  odds 
that  a  medium-light  crab  has  a  satellite  are  exp(  1 .330)  =  3.8  times  the  estimated  odds  for  a 
dark  crab.  At  width  x  =  26.3,  the  odds  equal  0.715/0.285  =  2.51  for  a  medium-light  crab 
and  0.399/0.601  =  0.66  for  a  dark  crab,  for  which  2.51/0.66  =  3.8. 
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Table  5.9  Software  Output  (Based  on  SAS)  for  Model  with  Width  and  Color  Predictors  of 
Whether  Horseshoe  Crab  Has  Satellites 


Criteria  For  Assessing  Goodness  Of  Fit 


Criterion 

DF 

Value 

Deviance 

168 

187 .4570 

Pearson  Chi-Square  168 

168 . 6590 

Log 

Likelihood 

-93.7285 

Standard 

Likelihood- 

-Ratio  95% 

Chi- 

Parameter 

Estimate 

Error 

Confidence 

Limits 

Square 

Pr>ChiSq 

intercept 

-12.7151 

2.7618 

-18.4564 

-7.5788 

21.20 

<.0001 

cl 

1.3299 

0.8525 

-0.2738 

3.1354 

2.43 

0.1188 

c2 

1.4023 

0 . 5484 

0.3527 

2 . 5260 

6 . 54 

0 . 0106 

c3 

1 . 1061 

0.5921 

-0.0279 

2 . 3138 

3.49 

0 . 0617 

width 

0.4680 

0 . 1055 

0.2713 

0.6870 

19.66 

<.0001 

To  test  whether  color  contributes  significantly  to  model  (5.13),  we  test  Hq :  P\  =  Pi  = 
=  0.  This  states  that  controlling  for  width,  the  probability  of  a  satellite  is  independent 
of  color.  We  compare  the  maximized  log-likelihood  L\  for  the  full  model  (5.13)  to  Lq  for 
the  simpler  model.  The  test  statistic  — 2(L0  —  L  i )  =  7.0  has  df  =  3,  the  difference  between 
the  numbers  of  parameters  in  the  two  models.  The  chi-squared  P- value  of  0.07  provides 
slight  evidence  of  a  color  effect. 

The  model  assumes  a  lack  of  interaction  between  color  and  width  in  their  effects.  Width 
has  the  coefficient  of  0.468  for  all  colors,  so  the  shapes  of  the  curves  relating  width  to 
P(Y  =  1)  are  identical.  Figure  5.7  displays  the  fitted  model.  Any  one  curve  equals  any 


Figure  5.7  Logistic  regression  model  using  additive  width  and  color  predictors  of  whether  horseshoe  crab  has 
satellites. 
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other  curve  shifted  to  the  right  or  left.  The  parallelism  of  curves  in  the  horizontal  dimension 
implies  that  any  two  curves  never  cross.  At  all  width  values,  color  4  (dark)  has  a  lower 
estimated  probability  of  a  satellite  than  the  other  colors.  There  is  a  noticeable  positive  effect 
of  width. 

The  more  complex  model  allowing  color  x  width  interaction  has  three  additional  terms, 
the  cross-products  of  width  with  the  color  indicator  variables.  Fitting  this  model  is  equivalent 
to  fitting  logistic  regression  with  width  predictor  separately  for  crabs  of  each  color.  Each 
color  then  has  a  different-shaped  curve  relating  width  to  P(Y  =  1),  so  a  comparison  of 
two  colors  varies  according  to  the  width  value.  The  likelihood-ratio  statistic  comparing  the 
models  with  and  without  the  interaction  terms  equals  4.4,  with  df  =  3.  The  evidence  of 
interaction  is  weak  (P  —  0.22). 


5.4.6  Quantitative  Treatment  of  Ordinal  Predictor 

Color  has  ordered  categories,  from  lightest  to  darkest.  A  simpler  model  yet  treats  this 
predictor  as  quantitative.  Color  may  have  a  linear  effect,  for  a  set  of  monotone  scores.  To 
illustrate,  for  scores  c  =  ( 1 , 2,  3, 4)  for  the  color  categories,  the  model 

logit[P(T  =  I )]  =  a  +  P\ c  +  (i2x  (5.14) 

has  a  =  —10.071,  /S,  =  -0.509  (SE  =  0.224)  and  $2  =  0.458  (SE  =  0.104).  This  shows 
strong  evidence  of  an  effect  for  each.  At  a  given  width,  for  every  one-category  increase  in 
color  darkness,  the  estimated  odds  of  a  satellite  multiply  by  exp(— 0.509)  —  0.60. 

The  likelihood-ratio  statistic  comparing  this  fit  to  the  more  complex  model  (5.13)  having 
a  separate  parameter  for  each  color  equals  1 .66  (df  =  2).  This  statistic  tests  that  the  simpler 
model  (5.14)  is  adequate,  given  that  model  (5.13)  holds.  It  tests  that  when  plotted  against 
the  color  scores,  the  color  parameters  in  (5.13)  follow  a  linear  trend.  The  simplification 
seems  permissible  (P  —  0.44). 

The  color  parameter  estimates  in  the  qualitative-color  model  (5.13)  are  (1.33,  1.40, 
1.11,  0),  the  0  value  for  the  dark  category  reflecting  its  lack  of  an  indicator  variable. 
Although  these  values  do  not  depart  significantly  from  a  linear  trend,  the  first  three 
are  quite  similar  compared  with  the  last  one.  Thus,  another  potential  color  scoring  for 
model  (5.14)  is  (1,  1,  1,0);  that  is,  score  =  0  for  dark-colored  crabs,  and  score  =  1  oth¬ 
erwise.  The  likelihood-ratio  statistic  comparing  model  (5.14)  with  these  binary  scores  to 
model  (5.13)  equals  0.50  (df  =  2),  showing  that  this  simpler  model  is  also  adequate.  Its 
fit  is 


logit[P(T  =  1)]  =  -12.980+  1.300c +  0.478.V,  (5.15) 


with  standard  errors  0.526  and  0.104.  At  a  given  width,  the  estimated  odds  that  a  lighter- 
colored  crab  has  a  satellite  are  exp(l  .300)  =  3.7  times  the  estimated  odds  for  a  dark  crab. 

In  summary,  the  qualitative-color  model,  the  quantitative-color  model  with  scores 
( 1 , 2,  3,  4),  and  the  model  with  binary  color  scores  ( 1 ,  1 ,  1,0)  all  suggest  that  dark  crabs 
are  least  likely  to  have  satellites.  A  much  larger  sample  is  needed  to  determine  which 
color  scoring  is  most  appropriate.  With  moderate-sized  samples,  it’s  not  unusual  for  quite 
different  models  to  be  consistent  with  the  data. 
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5.4.7  Probability-Based  and  Standardized  Interpretations 

Although  it  is  natural  to  interpret  logistic  regression  model  parameters  as  effects  on  a 
log  odds,  some  find  it  difficult  to  understand  odds  or  odds  ratio  effects.  The  simpler 
interpretation  using  the  instantaneous  rate  of  change  in  the  probability  (Section  5.1.1) 
applies  also  to  multiple  predictors.  Consider  a  setting  of  predictors  at  which  P(Y  =  1)  =  ft. 
Then,  adjusting  for  the  other  predictors,  as  a  function  of  a  quantitative  predictor  Xj,  ft 
has  instantaneous  rate  of  change  of  Pjfi(\  —  ft).  For  instance,  at  predictor  settings  at 
which  ft  =  0.50  for  fit  (5.15),  the  approximate  effect  of  a  1-cm  increase  in  width  is 
(0.478)(0.50)(0.50)  =  0.12.  This  is  considerable,  since  a  1-cm  change  in  width  is  less  than 
half  a  standard  deviation. 

We  could  summarize  the  effect  of  xj  on  the  probability  scale  by  averaging  the  instanta¬ 
neous  rates  for  the  sample.  Let  xy  denote  the  value  of  Xj  for  subject  i  and  let  ft  (a,  i ,  . . . ,  xlp) 
denote  the  estimate  of  P(Y  =  1)  at  the  explanatory  variable  values  for  subject  i.  This 
summary  is 


-^20j^(xn<  ■  ■  ••*(>)[ 1  ~  ft  xip)]. 

1  =  1 

Alternatively,  to  describe  the  effect  of  a,  in  a  simpler  manner  not  depending  on  its 
units,  we  could  set  the  other  predictors  at  their  sample  means  and  compute  the  estimated 
probabilities  at  the  smallest  and  largest  Xj  values.  These  are  sensitive  to  outliers,  however, 
so  we  could  instead  use  the  upper  and  lower  quartiles  of  xr  For  the  fit  (5.15)  with  binary 
color,  the  sample  means  are  26.3  for  x  and  0.873  for  c.  The  lower  and  upper  quartiles  of 
.v  are  24.9  and  27.7.  At  a  =  24.9  and  c  ~c.fi  =  0.5 1 .  At  x  =  27.7  and  c  =  c,n  =  0.80. 
The  change  in  ft  from  0.5 1  to  0.80  over  the  middle  50%  of  the  range  of  width  values  reflects 
a  strong  width  effect.  Since  c  takes  only  values  0  and  1 ,  we  could  instead  report  this  effect 
separately  for  each.  Also,  when  an  explanatory  variable  is  an  indicator,  it  makes  sense  to 
report  the  estimated  probabilities  at  its  two  values  rather  than  at  quartiles,  which  could  be 
identical.  At  x  —  26.3,  ft  =  0.40  when  c  =  0  and  ft  =0.71  when  c  —  1 .  This  color  effect, 
differentiating  dark  crabs  from  others,  is  also  substantial. 

Table  5. 10  summarizes  the  logistic  parameter  estimates  and  some  probability  comparison 
effects.  It  also  shows  results  of  the  extension  of  model  (5.15),  permitting  interaction.  The 


Table  5.10  Summary  of  Effects  in  Model  (5.15)  with  Crab  Width  and  Color 
(Treated  as  Binary)  as  Predictors  of  Presence  of  Satellites 


Variable 

Estimate 

SE 

Comparison 

Change  in  Probability 

No  interaction  model 

Intercept 

Color  (0  =  dark, 

-12.980 

2.727 

1  =  other) 

1.300 

0.526 

(1,0)  at  x 

0.31  =  0.71  -  0.40 

Width,  .v  (cm) 
Interaction  model 

0.478 

0.104 

( UQ .  LQ)  at  c 

0.29  =  0.80-  0.51 

Intercept 

Color  (0  =  dark, 

-5.854 

6.694 

I  =  other) 

-6.958 

7.318 

Width,  .v  (cm) 

0.200 

0.262 

( UQ.LQ )  at  c  =  0 

0.13  =  0.43-0.30 

Width  x  color 

0.322 

0.286 

(UQ.  LQ)  at  c  =  1 

0.29  =  0.84  -  0.55 
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estimated  width  effect  is  then  greater  for  the  lighter-colored  crabs.  However,  the  interaction 
is  not  significant. 

To  compare  effects  of  quantitative  predictors  having  different  units,  it  can  also  be  helpful 
to  report  standardized  coefficients.  One  approach  fits  the  model  to  standardized  predictors, 
replacing  each  xj  by  (xj  —  Xj)/sx  .  Then,  each  regression  coefficient  represents  the  effect 
of  a  standard  deviation  change  in  a  predictor,  adjusting  for  the  other  variables.  Equiva¬ 
lently,  for  each  j  the  standardized  coefficient  results  from  multiplying  the  unstandardized 
estimate  fij  by  sx  .  For  example,  for  fit  (5.15)  with  binary  color,  the  standard  deviation 
of  width  is  2.109  cm.  The  standardized  estimate  for  the  effect  of  width  for  that  model  is 
0.478(2.109)  =  1.01.  When  we  replace  width  by  weight  (with  standard  deviation  0.577  kg) 
in  the  model,  the  unstandardized  estimate  1 .729  corresponds  to  the  standardized  estimate 
1.729(0.577)  =  1.00.  The  unstandardized  estimates  0.478  and  1.729  are  quite  different,  but 
width  and  weight  (standardized)  have  similar  effects,  conditional  on  whether  or  not  a  crab 
is  dark. 

Since  the  standard  logistic  cdf  has  standard  deviation  7r/\/3,  some  software  (e.g.,  PROC 
LOGISTIC  in  SAS)  defines  a  standardized  estimate  by  multiplying  the  unstandardized 
estimate  by  sx  .\/ 3/n.  Such  a  standardized  estimate  represents  the  effect  on  the  location  of 
an  underlying  latent  response  variable  (in  standard  deviations  units)  for  a  standard  deviation 
change  in  a  predictor,  adjusting  for  the  other  variables.  For  example,  for  fit  (5.15)  with 
binary  color,  this  standardized  estimate  for  the  effect  of  width  is  0.478(2. 109)\/3/7r  = 
0.556.  A  standard  deviation  change  in  width,  conditional  on  a  color,  corresponds  to  a 
0.556  standard  deviation  shift  upwards  in  the  distribution  of  the  latent  logistic  response 
variable. 


5.4.8  Estimating  an  Average  Causal  Effect 

In  many  applications  the  explanatory  variable  of  primary  interest  specifies  two  groups  to 
be  compared  while  adjusting  for  the  other  explanatory  variables  in  the  model.  Let  j  =  1 
identify  this  binary  group  variable,  with  the  groups  denoted  by  x\  =  0  and  X\  —  1 .  For  the 
logistic  regression  model,  an  alternative  to  the  log  odds  ratio  as  an  effect  summary  is  the 
estimated  average  causal  effect. 


1 

n 


0*(* 


/I  —  %i2 ?  •  •  •  %ip)  Tli^Xj  |  —  0,  Xj 2 ,  . 


■Xip)]- 


For  each  observation  /,  we  find  the  fitted  probability  for  the  given  values  of  x,2,  . . .  xlp  (1) 
if  that  observation  were  in  group  1  and  (2)  if  that  observation  were  in  group  0,  and  average 
the  differences  among  all  n  observations.  This  estimates  the  difference  between  the  overall 
proportions  of  “successes”  if  all  subjects  in  the  study  were  in  group  1  compared  with  all 
being  in  group  0.  It  is  usually  not  adequate  to  use  a  linear  probability  model  (i.e.,  identity 
link  function)  for  the  full  data  set,  by  which  such  a  difference  would  be  constant  across 
subjects,  but  nonetheless  this  is  a  useful  summary  for  cases  in  which  this  difference  is 
relatively  stable. 

We  illustrate  using  Table  5.6  from  the  randomized  study  of  AZT  use  and  AIDS.  In 
Section  5.4.2  we  summarized  the  effect  of  AZT  use  by  the  estimated  conditional  odds 
ratio  of  exp(/6i)  =  0.487.  Alternatively,  from  the  probability  estimates  shown  in  Table  5.7, 
the  difference  between  those  not  receiving  AZT  and  those  receiving  AZT  in  the  estimated 
proportion  developing  AIDS  symptoms  was  0.2654  —  0.1496  =  0.1158  for  whites  and 
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0.2547  —  0. 1427  =  0. 1 120  for  blacks.  Weighted  by  the  sample  sizes  of  whites  and  blacks, 
the  estimated  average  causal  effect  is  (220/338)(0. 1 158)  +  (11 8/338)(0. 1 120)  =  0.1 145. 
In  fact,  this  is  similar  to  the  ML  estimate  of  —  0.1152  for  the  corresponding  linear 
probability  model. 

For  categorical  predictors,  Copas  and  Eguchi  (2010)  showed  how  to  obtain  a  standard 
error  for  an  estimated  average  causal  effect  that  applies  for  the  logistic  model.  They  also 
presented  a  nonparametric  standard  error  for  an  estimate  that,  instead  of  being  model-based, 
is  a  weighted  average  of  the  differences  of  the  sample  proportions  at  the  various  levels  of  the 
explanatory  variables.  The  main  theme  of  their  article,  however,  was  adjusting  inferences 
for  the  fact  that  many  models  may  be  consistent  with  the  data.  The  average  causal  effect  is 
often  a  relevant  measure  regardless  of  the  form  of  the  true  relationship. 

Estimating  an  average  causal  effect  is  natural  for  experimental  studies.  It  has  also 
received  much  attention  for  nonrandomized  studies  since  the  fundamental  article  by  Rubin 
(1974)  and  later  work  using  methods  to  adjust  for  different  propensities  of  a  subject  to  be 
in  one  group  or  the  other  (e.g.,  see  Section  6.4. 1 1). 


5.5  FITTING  LOGISTIC  REGRESSION  MODELS 


The  mechanics  of  ML  estimation  and  model  fitting  for  logistic  regression  are  special 
cases  of  the  GLM  fitting  results  of  Section  4.6.  With  n  subjects,  we  treat  the  n  binary 
responses  as  independent.  Let  x,  =  (x,o,  x,i ,  • .  ■ ,  XjP)  denote  setting  i  of  the  values  of  p 

explanatory  variables  and  a  coefficient  x/o  =  1  for  an  intercept  term,  i  —  1 . N.  When 

explanatory  variables  are  continuous,  a  different  setting  may  occur  for  each  subject,  in 
which  case  N  =  n.  This  also  happens  when  the  data  file  consists  of  ungrouped  data.  The 
logistic  regression  model  (5.8),  treating  the  intercept  a  as  a  regression  parameter  /So  for  an 
explanatory  variable  that  always  equals  1 ,  is 


7t(Xj) 


exp  ( E;;Lo  /V,y) 

1  +  exp  (  J2Pj=o  Pjxu) 


(5.16) 


5.5.1  Likelihood  Equations  for  Logistic  Regression 

When  more  than  one  observation  occurs  at  a  fixed  x,  value,  it  is  sufficient  to  record  the 
number  of  observations  n,  and  the  number  of  successes.  We  then  let  y,  refer  to  this  success 
count  rather  than  to  an  individual  binary  response.  Then  {Tj, . . . ,  Y n)  are  independent 

binomials  with  £(F,)  =  W/7t(jc,),  where  W|  4 - 1-  nN  =  n.  Their  joint  probability  mass 

function  is  proportional  to  the  product  of  N  binomial  functions, 


P|7T(X/)>V[1  —  n{Xj)Y'  y‘ 


exp 


2^  y>  log  1 - - — ; 

l“T  1  -7r(x'). 


fj[l  -JT(X/)]" 
i=\ 

J  jf[[l  -7T(x,)rj  . 
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For  model  (5.16),  the  rth  logit  is  Pjxti’  so  exponential  term  here 

equals  exp  [  J2,  y>  (  E,  Pjxu)]  =  exP  [  E,  (  E;  yixu)Pj]-  Also-  since  t1  ~  *(xi)\  =  [l  + 
exp  (  E  /  Pjxu)\  1  -  the  log  likelihood  equals 


L(p)  =  E  ( E  y<x^j  -  E  n>  !°g 


1  +  exp 


(5.17) 


This  depends  on  the  binomial  counts  only  through  the  sufficient  statistics  for  the  model 
parameters,  {£,  j  =  0,  1, . . . ,  p. 

The  likelihood  equations  result  from  setting  9L(/?)/9/3  =  0.  Since 

dL(fi)  exp  (Xlr  PkXjk) 

~W~^T  y'X,J  "  r  "'"'"l+exp  (ErA-v,,)’ 

the  likelihood  equations  are 

E  y'x'j  ~  E  n'*‘x>j  =  °*  j  =  1 . P’  (5- 1 8) 

i  i 


where  ft,  —  exp  ( $k*ik)/[ •  +  exp  ( Et  &•*/*)]  >s  the  ML  estimate  of  7t(jc,-).  We  ob¬ 
served  these  equations  as  a  special  case  of  those  for  binomial  GLMs  in  (4.28)  (but  there  y, 
is  the  proportion  of  successes).  The  equations  are  nonlinear  and  require  iterative  solution. 

Let  X  denote  the  matrix  of  values  of  (,v,y),  with  N  rows  for  the  binomial  observations 
and  a  column  for  each  parameter.  The  likelihood  equations  (5.18)  have  form 

XTy  =  XT(i,  (5.19) 


where  p,  =  n ,  ftj .  This  equation  illustrates  the  fundamental  result  for  GLMs  with  canonical 
link,  shown  in  equation  (4.5 1),  that  the  likelihood  equations  equate  the  sufficient  statistics 
to  their  expected  values. 


5.5.2  Asymptotic  Covariance  Matrix  of  Parameter  Estimators 

The  ML  estimators  ft  have  a  large-sample  normal  distribution  with  covariance  matrix  equal 
to  the  inverse  of  the  information  matrix.  The  observed  information  matrix  has  elements 


d2up)  _  xiaxihn,  exp  (  PjXjj) 
SPudPf,  .  [|  +  exp  Pjxij)] 


E 


T/  aX  i  b  n  t  kk  i  (  1  7tj ) . 


(5.20) 


This  is  not  a  function  of  {y,  },  so  the  observed  and  expected  information  are  identical.  This 
happens  for  all  GLMs  that  use  canonical  links  (Section  4.6.5). 

The  estimated  covariance  matrix  is  the  inverse  of  the  matrix  having  elements  (5.20), 
substituting  /?.  This  has  form 


c5v(j8)  =  {ArDiag(«l7f,(l  -  *,)]*}-'. 


(5.21) 
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where  Diag[n,7r,(l  —  : f,)]  denotes  the  N  x  N  diagonal  matrix  having  j«,7r,  ( I  —  ;r,  )}  on 
the  main  diagonal.  This  is  the  special  case  of  the  GLM  covariance  matrix  (4.31)  with 
estimated  diagonal  weight  matrix  W  having  elements  vv,  =  w,7f,  (l  —  7r,  ).  The  square  roots 
of  the  main  diagonal  elements  of  (5.21)  are  estimated  standard  errors  of  /?. 

5.5.3  Distribution  of  Probability  Estimators 

Using  cov(/?),  we  can  conduct  Wald  inference  about  ft  and  related  effects  such  as  odds  ratios. 
We  can  also  construct  confidence  intervals  for  response  probabilities  7t(x)  at  particular 
settings  xT  =  (x0,  x\ , . . . ,  xp). 

The  estimated  variance  of  logit[jf (jr)]  =  xT f}  is  xTcov(f))x.  For  large  samples, 

logit[7r (x)]  ±  za/2-J xTco\(ji)x  is  a  confidence  interval  for  the  true  logit.  The  endpoints 
invert  to  a  corresponding  interval  for  tt(x)  using  the  transform  7t  =  exp(logit)/[l  + 
exp(logit)]. 


5.5.4  Newton-Raphson  Method  Applied  to  Logistic  Regression 

Section  4.6. 1  introduced  the  Newton-Raphson  iterative  method,  which  applies  in  a  straight¬ 
forward  manner  to  logistic  regression.  Let 


«?’  = 


h(,)  - 

nab  ~ 


dL{P) 


dpj 

d2up) 


dPa  d  Pk 


pc) 


=  ^(yi  ~  niTi\n)Xij, 
i 

=  -'Y^xiaxlhniix\,\\  -  7r/°). 


Here,  7r(,),  approximation  t  for  A,  is  obtained  from  f}<!>  through 

=  exp  (e;=,  Pj'xij) 

1  +  exp  (EU  Pfxij ) 


(5.22) 


We  use  m(,)  and  H(,)  with  formula  (4.45)  to  obtain  the  next  value  /J('+1),  which  in  this 
context  is 

j8(,+l)  =  P{,)  +  {ArDiag[«,7r,(')(l  -  7r,(,))]A}“' XT(y  -  /t(,)),  (5.23) 

where  ji'p  =  tijji-n.  This  is  used  to  obtain  3T(,+1),  and  so  forth. 

With  an  initial  guess  f){0\  (5.22)  yields  7t(0),  and  for  t  >  0  the  iterations  proceed  as  just 
described  using  (5.23)  and  (5.22).  In  the  limit,  and  j3(,)  converge  to  the  ML  estimates  n 
and  ft  (Walker  and  Duncan  1967).  The  H{,)  matrices  converge  to  H  =  —  TrDiag[«,7r,  (  1  — 
itj)]X.  By  (5.21)  the  estimated  asymptotic  covariance  matrix  of  ft  is  a  by-product  of  the 
Newton-Raphson  method,  namely  —  H 
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From  the  argument  in  Section  4.6.4,  has  the  iterative  reweighted  least-squares 

form  (XT  V~f X)~' XT  V~' z(l\  where  z(,)  has  elements 


z<° 


=  log 


7T, 


(0 


1  -  n) 


(0 


+ 


I’/  -  niTtf 


(0 


(5.24) 


and  where  V,  is  a  diagonal  matrix  with  elements  { 1  /«, tt/,)(1  —  tt /?))}.  In  this  expression, 
z<r>  is  the  linearized  form  of  the  logit  link  function  for  the  sample  data,  evaluated  at  jr(,) 
[see  (4.49)].  From  Section  3.1.6  the  elements  of  V,  are  estimated  asymptotic  variances  of 
the  sample  logits.  The  ML  estimate  is  the  limit  of  a  sequence  of  weighted  least-squares 
estimates,  where  the  weight  matrix  changes  at  each  cycle. 

The  log  likelihood  is  concave,  so  there  is  no  danger  of  iterative  methods  converging  to  a 
local  maximum.  Flowever,  in  some  cases  at  least  one  estimate  may  be  infinite,  as  discussed 
in  Section  6.5. 


NOTES 

Section  5.1:  Interpreting  Parameters  in  Logistic  Regression 

5.1  Logistic  books:  Books  focusing  on  logistic  regression  include  Collett  (2003),  Cox  and  Snell 
(1989),  and  Hosmer  and  Lemeshow  (2000). 

5.2  Bias  reduction:  Haldane  (1956)  recommended  adding  -  to  each  count  in  estimating  a  logit. 
With  this  modification,  the  bias  is  on  the  order  of  only  1  /nj,  for  large  /?,■ .  See  also  Firth  ( 1 993a), 
Gart  and  Zweifel  (1967),  and  Exercise  16.8.  For  bias  reduction  in  logistic  regression  and  GLMs, 
see  Cordeiro  and  McCullagh  (1991)  and  Firth  (1993a). 

5.3  LD5q:  Paige  et  al.  (20 1 1 )  summarized  confidence  intervals  for  LD50  and  proposed  small-sample 
intervals  using  saddlepoint  approximations. 

5.4  Retrospective  logistic:  For  discussion  of  logistic  regression  with  retrospective  studies,  see 
Anderson  ( 1 972),  Breslow  ( 1 996),  Breslow  and  Day  ( 1 980,  p.  203),  Breslow  and  Powers  (1978), 
Carroll  et  al.  (1995),  Farewell  (1979),  Ghosh  and  Mukherjee  (2010),  Mantel  (1973),  Neuhaus 
and  Jewell  (1990b),  Piegorsch  et  al.  (1994),  Prentice  (1976a),  Prentice  and  Pyke  (1979),  Roeder 
et  al.  (1996),  and  Umbach  and  Weinberg  ( 1997).  Scott  and  Wild  (2001 )  considered  case-control 
studies  with  complex  sampling  designs,  and  Bhadra  et  al.  (2012)  incorporated  longitudinal 
information  on  exposure  history.  Qin  and  Liang  (201 1)  considered  a  mixture  model  to  handle 
situations  in  which  some  controls  are  contaminated.  See  Section  7.2.3  for  Bayesian  literature. 

5.5  Design:  Khuri  et  al.  (2006)  reviewed  articles  about  design  problems  for  binary  response  exper¬ 
iments.  Issues  include  choosing  settings  for  a  predictor  to  optimize  a  criterion  for  estimating 
parameter  values,  and  estimating  the  setting  at  which  the  response  probability  equals  some 
fixed  value.  The  nonconstant  variance  makes  this  challenging.  Zocchi  and  Atkinson  (1999) 
considered  multinomial  logistic  models. 


Section  5.2:  Inference  for  Logistic  Regression 

5.6  Fitting/checking:  Albert  and  Anderson  (1984),  Berkson  (1944,  1951,  1953,  1955),  Cox 
(1958a),  Hodges  (1958),  and  Walker  and  Duncan  (1967)  discussed  ML  estimation  for  lo¬ 
gistic  regression,  although  Berkson  argued  for  the  computationally  simpler  minimum  logit 
chi-squared.  For  adjustments  with  complex  sample  surveys,  see  Hosmer  and  Lemeshow  (2000, 
Sec.  6.4)  and  LaVange  et  al.  (2001).  Grouping  values  to  check  model  fit  extends  to  any  GLM 
(Pregibon  1982).  Hosmer  et  al.  (1997)  compared  various  ways  to  do  this.  Presnell  and  Boos 
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(2004)  proposed  a  general  likelihood-based  method  for  detecting  model  misspecification.  See 
also  Capanu  and  Presnell  (2008). 


Section  5.3:  Logistic  Models  with  Categorical  Predictors 

5.7  Trend  tests:  Extensions  of  the  trend  test  include  handling  of  correlated  binary  data  by  Corcoran 
et  al.  (2001)  and  stratified  /  x  J  tables  by  Mantel  (1963).  Williams  (2005)  surveyed  trend  tests 
for  proportions  and  counts. 


Section  5.4:  Multiple  Logistic  Regression 

5.8  Standardizing:  Menard  (2004)  discussed  several  approaches  to  standardizing  logistic  regres¬ 
sion  coefficients.  He  noted  that  merely  standardizing  predictors,  as  was  done  in  Section  5.4.7, 
is  adequate  for  comparing  influences  of  predictors. 

5.9  Quasi-variances:  For  multipredictor  models  such  as  (5.12),  tables  that  contain  factor-level 

estimates  {(if)  and  their  SE  values  but  not  their  covariance  matrix  permit  comparison  of 
each  category  to  the  baseline  (having  estimate  0)  but  not  to  other  categories.  Firth  and  De 
Menezes  (2004)  showed  how  to  construct  quasi-variances  { qk)  such  that  the  SE  of  is 

approximately  -f  qft. 


EXERCISES 

Applications 

5.1  An  article  about  the  contributions  of  star  players  in  the  National  Basketball  Associa¬ 
tion  (by  M.  L.  Jones  and  R.  J.  Parker,  Chance.  23,  39-45,  2010)  reported  prediction 
equations  for  the  probability  it  of  a  win  in  a  game  for  a  player,  using  as  predictors 
ortg  —  player’s  offensive  rating  in  the  game,  which  is  the  number  of  points  produced 
per  hundred  possessions,  drtg  —  player’s  defensive  rating  in  the  game,  which  is  the 
number  of  points  allowed  per  hundred  possessions  (the  lower  the  better),  and  home, 
which  indicates  whether  the  game  was  played  at  home  ( 1  =  yes,  0  =  no).  For  LeBron 
James  using  data  from  the  2008-2009  season, 

logit(jf)  =  1.379  +  0.1 19 (ortg)  —  0.\2>9(drtg)  +  3.393  {home). 

a.  Over  the  season,  James’s  quartiles  (lower,  median,  upper)  were  (108.7,  123.2, 
136.1)  for  ortg  and  (91.7,  99.5,  107.7)  for  drtg.  Summarize  the  ortg  effect  for 
James  by  comparing  A  at  its  upper  and  lower  quartiles.  Do  this  at  the  median 
level  of  drtg,  separately  for  home  and  away  games.  Repeat  for  the  drtg  effect, 
and  compare. 

b.  Summarize  the  home  effect  by  (i)  comparing  A  for  home  and  away  games,  at  the 
median  levels  of  ortg  and  drtg,  (ii)  interpreting  its  coefficient  in  the  fitted  logistic 
equation. 

5.2  For  a  study  using  logistic  regression  to  determine  characteristics  associated  with 
remission  in  cancer  patients.  Table  5.11  shows  the  most  important  explanatory 
variable,  a  labeling  index  (LI)  that  measures  proliferative  activity  of  cells  after 
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a  patient  receives  an  injection  of  tritiated  thymidine.  It  represents  the  percentage 
of  cells  that  are  “labeled.”  The  response  measured  whether  the  patient  achieved 
remission.  Software  reports  Table  5.12  for  a  logistic  regression  model  using  LI  to 
estimate  n  =  /’(remission). 

a.  Show  how  software  obtained  A  =  0.068  when  LI  =  8. 

b.  Show  that  A  =  0.50  when  LI  =  26.0. 

c.  Show  that  the  rate  of  change  in  A  is  0.009  when  LI  =  8  and  0.036  when  LI  =  26. 

d.  The  lower  quartile  and  upper  quartile  for  LI  are  14  and  28.  Show  that  A  increases 
by  0.42,  from  0.15  to  0.57,  between  those  values. 

e.  For  a  unit  increase  in  LI,  show  that  the  estimated  odds  of  remission  multiply  by 
1.16. 

f.  Explain  how  to  obtain  the  confidence  interval  reported  for  the  odds  ratio.  Interpret. 

g.  Construct  a  Wald  test  for  the  effect.  Interpret. 


Table  5.11  Data  for  Exercise  5.2  on  Cancer  Remission 


LI 

Number 
of  Cases 

Number  of 
Remissions 

LI 

Number 

of  Cases 

Number  of 
Remissions 

LI 

Number 

of  Cases 

Number  of 
Remissions 

8 

2 

0 

18 

1 

1 

28 

1 

1 

10 

2 

0 

20 

3 

2 

32 

1 

0 

12 

3 

0 

22 

2 

1 

34 

1 

1 

14 

3 

0 

24 

1 

0 

38 

3 

2 

16 

3 

0 

26 

1 

1 

Source:  Data  reprinted  with  permission  from  E.  T.  Lee,  Comput.  Prog.  Biomed.  4:  80-92,  1 974. 


Table  5.12  Software  Output  (Based  on  SAS)  for  Exercise  5.2 


Intercept 

Criterion  Only 

-2  Log  L  34 . 372 


Intercept  and 
Covariates 
26.073 


Parameter 

Intercept 

li 


Estimate 
-3.7771 
0 . 1449 


Standard  Error 
1.3786 
0.0593 


Chi-Square 
7 . 5064 
5 . 9594 


Pr  >  ChiSq 
0 . 0061 
0 .0146 


Odds  Ratio  Estimates 

Effect  Point  Estimate  95%  Wald  Confidence  Limits 
li  1.156  1.029  1.298 


Estimated  Covariance  Matrix 
Variable  Intercept  li 

Intercept  1.900616  -0.07653 

li  -0.07653  0.003521 


Obs 

li 

remiss 

n 

pi.hat 

lower 

upper 

1 

8 

0 

2 

0 . 06797 

0 . 01121 

0 .31925 

2 

10 

0 

2 

0 . 08879 

0 . 01809 

0.34010 
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h.  Conduct  a  likelihood-ratio  test  for  the  effect,  showing  how  to  construct  the  test 
statistic  using  the  —2  log  L  values  reported. 

i.  Show  how  software  obtained  the  confidence  interval  for  n  reported  at  LI  =  8. 
[Hint:  Use  the  reported  covariance  matrix.] 

5.3  The  text  website  has  a  data  file  (created  from  data  at  www. basketball - 
reference.com)  showing,  for  each  game  in  the  2010-2011  season  of  the  Na¬ 
tional  Basketball  Association  in  which  Rajon  Rondo  of  the  Boston  Celtics  played, 
x  =  the  number  of  assists  he  recorded  and  y  =  whether  the  Celtics  won  (1  = 
yes).  Using  software,  (a)  show  that  the  logistic  model  fitted  to  these  data  gives 
logit[P(T  =  1)]  =  —2.235  +  0.294*;  (b)  show  that  P(Y  =  1)  increases  from  0.21 
to  0.99  over  the  observed  range  of  x  from  3  to  24;  and  (c)  construct  a  significance 
test  and  confidence  interval  about  the  effect  in  the  conceptual  population  that  these 
games  represent. 

5.4  Table  5.13  summarizes  logistic  regression  results  from  a  study1  of  how  family 
transitions  relate  to  first  home  purchase  by  young  married  households.  The  response 
variable  is  whether  the  subject  owns  a  home  (1  =  yes,  0  =  no). 

a.  Interpret  the  effects  that  seem  to  be  significant. 

b.  Fill  in  the  blanks:  Adjusting  for  the  other  explanatory  variables,  each  additional 

child  had  the  effect  of  multiplying  the  estimated  odds  of  owning  a  home  by _ ; 

that  is,  the  estimated  odds  increase  by _ %.  A  $10,000  increase  in  earnings 

had  the  effect  of  multiplying  the  estimated  odds  of  owning  a  home  by _ if 

the  earnings  add  to  husband’s  income  and  by _ for  wife’s  income. 


Table  5.13  Results  of  Logistic  Regression  for  Probability  of 
Home  Ownership 


Variable 

Estimate 

Std.  Error 

Intercept 

-2.870 

— 

Husband  earnings  ($10,000) 

0.569 

0.088 

Wife  earnings  ($10,000) 

0.306 

0.140 

Number  of  years  married 

-0.039 

0.042 

Married  in  2  years  ( 1  =  yes) 

0.224 

0.304 

Working  wife  in  2  years  (1  =  yes) 

0.373 

0.283 

Number  of  children 

0.220 

0.101 

Add  child  in  2  years  ( 1  =  yes) 

0.271 

0.140 

Head’s  education  (no.  years) 

-0.027 

0.032 

Parents’  home  ownership  (1  =  yes) 

0.387 

0.176 

5.5  Consider  the  fit  of  model  (5.2)  for  the  horseshoe  crabs  using  x  —  width. 

a.  Show  that  (i)  at  the  mean  width  (26.3),  the  estimated  odds  of  a  satellite  equal 
2.07;  (ii)  at*  =  27.3,  the  estimated  odds  equal  3.40;  and  (iii)  since  exp(/S)  =  1.64, 
3.40  =  ( 1 .64)2.07,  and  the  odds  increase  by  64%. 


'From  J.  Henretta,  Social  Forces  66:  520-536,  1987. 
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b.  Based  on  the  95%  confidence  interval  for  /3,  show  that  for  x  near  where  n  =  0.50, 
the  rate  of  increase  in  the  probability  of  a  satellite  per  1-cm  increase  in  x  falls 
between  about  0.07  and  0.17. 

5.6  For  the  23  space  shuttle  flights  before  the  Challenger  mission  disaster  in  1986, 
Table  5.14  shows  the  temperature  at  the  time  of  the  flight  and  whether  at  least  one 
primary  O-ring  suffered  thermal  distress. 

a.  Use  logistic  regression  to  model  the  effect  of  temperature  on  the  probability  of 
thermal  distress.  Plot  a  figure  of  the  fitted  model,  and  interpret. 

b.  Estimate  the  probability  of  thermal  distress  at  31°F,  the  temperature  at  the  place 
and  time  of  the  Challenger  flight. 

c.  Construct  a  confidence  interval  for  the  effect  of  temperature  on  the  odds  of  thermal 
distress,  and  test  the  statistical  significance  of  the  effect. 


Table  5.14  Data  for  Exercise  5.6  on  Challenger  Space-Shuttle  Disaster" 


Ft 

Temp 

TD 

Ft 

Temp 

TD 

Ft 

Temp 

TD 

Ft 

Temp 

TD 

Ft 

Temp 

TD 

1 

66 

0 

2 

70 

1 

3 

69 

0 

4 

68 

0 

5 

67 

0 

6 

72 

0 

7 

73 

0 

8 

70 

0 

9 

57 

1 

10 

63 

1 

11 

70 

1 

12 

78 

0 

13 

67 

0 

14 

53 

1 

15 

67 

0 

16 

75 

0 

17 

70 

0 

18 

81 

0 

19 

76 

0 

20 

79 

0 

21 

75 

1 

22 

76 

0 

23 

58 

1 

"Ft,  flight  number;  Temp,  temperature  (°F);  TD,  thermal  distress  (1,  yes;  0,  no). 

Source :  Data  based  on  Table  1  in  J.  Am.  Statist.  Assoc.  84:  945-957,  1989,  by  S.  R.  Dalai,  E.  B.  Fowlkes,  and 
B.  Hoadley.  Reprinted  with  permission  from ./.  Am.  Statist.  Assoc. 


5.7  Refer  to  Table  4.2.  Using  scores  (0,  2,  4,  5)  for  snoring,  fit  the  logistic  regression 
model.  Interpret  using  fitted  probabilities,  linear  approximations,  and  effects  on  the 
odds.  Analyze  the  goodness  of  fit. 

5.8  Hastie  and  Tibshirani  ( 1 990,  p.  282)  described  a  study  to  determine  risk  factors  for 
kyphosis,  severe  forward  flexion  of  the  spine  following  corrective  spinal  surgery. 
The  age  in  months  at  the  time  of  the  operation  for  the  1 8  subjects  for  whom  kyphosis 
was  present  were  12,  15,  42,  52,  59,  73,  82,  91,  96,  105,  114,  120,  121,  128,  130, 
139,  1 39,  157  and  for  22  of  the  subjects  for  whom  kyphosis  was  absent  were  1 ,  1,2, 
8,  11,  18,  22,31,37,61,72,81,97,  112,  118,  127,  131,  140,  151,  159,  177,206. 

a.  Fit  a  logistic  regression  model  using  age  as  a  predictor  of  whether  kyphosis  is 
present.  Test  whether  age  has  a  significant  effect. 

b.  Plot  the  data.  Note  the  difference  in  dispersion  on  age  at  the  two  levels  of  kyphosis. 
Fit  the  model  logit[7r(x)]  =  a  +  fi\x  +  fox2-  Test  the  significance  of  the  squared 
age  term,  plot  the  fit,  and  interpret.  (See  also  Exercise  5.30  and  Section  7.4.3.) 

5.9  For  Table  5.5  on  treating  leprosy,  the  Pearson  test  of  independence  has  X2{I)  = 
6.88  (P  =  0.14).  For  equally  spaced  scores,  the  Cochran-Armitage  trend  test  has 
z2  —  6.67  (P  =  0.01).  Interpret,  and  explain  why  the  P-values  differ  so.  Ana¬ 
lyze  the  data  using  a  linear  logit  model.  Test  independence  using  the  Wald  and 
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likelihood-ratio  tests,  and  compare  results  to  the  Cochran-Armitage  test.  Check  the 
fit  of  the  model,  and  interpret. 

5.10  Refer  to  Table  5.3  on  infant  malformation  and  alcohol  consumption. 

a.  Repeat  the  trend  test  of  Section  5.3.5  after  deleting  the  single  case  in  the  last  row. 
Comment  on  that  observation’s  influence. 

b.  Repeat  the  trend  test  using  alcohol  consumption  scores  (1, 2,  3,  4,  5)  instead  of 
(0.0,  0.5,  1.5,  4.0,  7.0).  Compare  results,  noting  the  potential  sensitivity  to  the 
choice  of  scores  for  highly  unbalanced  data. 


5.11  A  study  used  the  1998  Behavioral  Risk  Factors  Social  Survey  to  consider  factors 
associated  with  women’s  use  of  oral  contraceptives  in  the  United  States.  Table  5.15 
summarizes  effects  for  a  logistic  regression  model  for  the  probability  of  using  oral 
contraceptives.  Each  predictor  uses  an  indicator  variable,  and  the  table  lists  the 
category  having  indicator  outcome  1.  Interpret  effects.  Construct  and  interpret  a 
confidence  interval  for  the  conditional  odds  ratio  between  contraceptive  use  and 
education. 

Table  5.15  Data  for  Exercise  5.11  on  Oral  Contraceptive  Use 


Variable 


Coding  =  1  if: 


Estimate 


Sourer.  Data  courtesy  of  Debbie  Wilson,  College  of  Pharmacy,  University  of  Florida. 


5.12  For  the  horseshoe  crab  data,  available  at  www.stat.ufl.edu/~aa/ 
eda/eda .  html,  fit  a  logistic  regression  model  for  the  probability  of  a  satellite, 
using  color  alone  as  the  predictor. 

a.  Treat  color  as  nominal.  Explain  why  this  model  is  saturated.  Express  its  parameter 
estimates  in  terms  of  the  sample  logits  for  each  color. 

b.  Conduct  a  likelihood-ratio  test  that  color  has  no  effect. 

c.  Fit  a  model  that  treats  color  as  quantitative.  Interpret  the  fit,  and  test  that  color 
has  no  effect. 

d.  Test  the  goodness  of  fit  of  the  model  in  part  (c).  Interpret. 

5.13  For  model  (5.15)  with  binary  color  c  and  width  x,  (a)  describe  the  effect  of  width 
by  finding  the  estimated  probabilities  of  a  satellite  at  its  lower  and  upper  quartiles, 
separately  for  c  =  1  and  c  =  0,  and  (b)  describe  the  effect  of  color  by  its  average 
causal  effect. 

5.14  Refer  to  the  prediction  equation  logit(if )  =  — 10.071  —  0.509c  +  0.458x  for  model 
(5 . 1 4)  using  quantitative  color  and  width.  The  means  and  standard  deviations  are  c  — 
2.44  and  s  =  0.80  for  color,  and  x  =  26.30  and  s  =  2. 1 1  for  width.  For  standardized 
predictors  [e.g.,  x  =  (width  —  26.30) /2. 11],  explain  why  the  estimated  coefficients 
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of  c  and  x  equal  —0.41  and  0.97.  Interpret  these  by  comparing  the  partial  effects 
of  a  standard  deviation  increase  in  each  predictor  on  the  odds.  Describe  the  color 
effect  by  estimating  the  change  in  ft  between  the  first  and  last  color  categories  at  the 
sample  mean  width. 

5.15  For  Table  2.6,  we  fitted  a  logistic  model,  treating  death  penalty  as  the  response  (1  = 
yes)  and  defendant’s  race  (1  —  white)  and  victims’  race  (1  =  white)  as  indicator 
predictors.  Table  5.16  shows  results. 

a.  Interpret  parameter  estimates.  Which  group  is  most  likely  to  have  the  yes  re¬ 
sponse?  Find  the  estimated  probability  in  that  case. 

b.  Interpret  95%  confidence  intervals  for  conditional  odds  ratios. 

c.  Test  the  effect  of  defendant’s  race,  controlling  for  victims’  race,  using  a  (i)  Wald 
test  and  (ii)  likelihood-ratio  test.  Interpret. 

d.  Test  the  goodness  of  fit  of  the  model.  Interpret. 

Table  5.16  Software  Output  (Based  on  SAS)  for  Exercise  5.15  on  the  Death  Penalty 


Parameter 

Intercept 

def 

vie 


Criteria  For  Assessing 
Criterion  DF 
Deviance  1 
Pearson  Chi-Square  1 
Log  Likelihood 


Goodness  Of  Fit 
Value 
0 .3798 
0 .1978 
-209.4783 


Estimate 

-3.5961 

-0.8678 

2.4044 


Standard 

Error 

0.5069 

0.3671 

0.6006 


Likelihood  Ratio 
95%  Conf  Limits 
-4.7754  -2.7349 

-1.5633  -0.1140 

1.3068  3.7175 


Chi-Square 
50.33 
5 . 59 
16 . 03 


Source  DF 
def  1 

vie  1 


LR  Statistics 
Chi-Square 
5.01 
20.35 


Pr  >  ChiSq 
0 . 0251 
<  .0001 


5.16  Model  the  effects  of  victim’s  race  and  defendant’s  race  for  Table  2.12.  Interpret. 

5.17  In  a  201 1  article  in  North  Carolina  Law  Review ,  M.  Radelet  and  G.  Pierce  reported 
a  logistic  prediction  equation  for  death  penalty  verdicts  in  North  Carolina.  Let  Y 
denote  whether  a  subject  convicted  of  murder  received  the  death  penalty  ( 1  =  yes), 
for  defendant’s  race  h  (h  =  1 ,  black;  h  =  2,  white),  victim’s  race  i  (i  —  1 ,  black; 
i  =  2,  white),  and  number  of  additional  factors  j  (  j  =  0,  1,  2).  For  the  model 

logit[P(T  =  1)]  =  a  + 

they  reported  a  =  -5.26,  fi?  =  0.00,  =0.17,  jivx  =  0.00,  =  0.91,  ^  = 

0.00,  =  2.02,  =  3.98. 

a.  Estimate  the  probability  of  receiving  the  death  penalty  for  the  group  most  likely 
to  receive  it. 
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b.  If,  instead,  parameters  used  constraints  =  P2  =  0,  report  the  estimates. 

c.  If,  instead,  parameters  used  constraints  /?;f  =  JT  fi]  =  JT  /?  ■  =  0,  report 
the  estimates. 

5. 18  In  a  study  designed  to  evaluate  whether  an  educational  program  makes  sexually  active 
adolescents  more  likely  to  obtain  condoms,  adolescents  were  randomly  assigned  to 
two  experimental  groups.  The  educational  program,  involving  a  lecture  and  videotape 
about  transmission  of  HIV,  was  provided  to  one  group  but  not  the  other.  Table  5.17 
summarizes  results  of  a  logistic  regression  model  for  factors  observed  to  influence 
teenagers  to  obtain  condoms. 

a.  Find  the  parameter  estimates  for  the  fitted  model,  using  (1,0)  indicator  variables 
for  the  first  three  predictors.  Based  on  the  corresponding  confidence  interval  for 
the  log  odds  ratio,  determine  the  standard  error  for  the  group  effect. 

b.  Explain  why  either  the  estimate  of  1.38  for  the  odds  ratio  for  gender  or  the 
corresponding  confidence  interval  is  incorrect.  Show  that  if  the  reported  interval 
is  correct,  1 .38  is  actually  the  log  odds  ratio,  and  the  estimated  odds  ratio  equals 
3.98. 


Table  5.17  Data  for  Exercise  5.18  on  Obtaining  Condoms 


Variable 

Odds  Ratio 

95%  Confidence 

Interval 

Group  (education  vs.  none) 

4.04 

(1.17,  13.9) 

Gender  (males  vs.  females) 

1.38 

(1.23.12.88) 

SES  (high  vs.  low) 

5.82 

(1.87,  18.28) 

Lifetime  number  of  partners 

3.22 

(1.08,  11.31) 

Source:  V.  I.  Ricker!  et  al.,  Clin.  Pediatr.  31:  205-210,  1992. 


5.19  Table  5.18  shows  estimated  effects  for  a  logistic  regression  model  with  squamous 
cell  esophageal  cancer  (Y  =  l,yes;  Y  =  0,  no)  as  the  response.  Smoking  status 
(S)  equals  1  for  at  least  one  pack  per  day  and  0  otherwise,  alcohol  consumption 
(A)  equals  the  average  number  of  alcoholic  drinks  consumed  per  day,  and  race  (R) 
equals  1  for  blacks  and  0  for  whites.  To  describe  the  R  x  S  interaction,  construct 
the  prediction  equation  when  R  =  1  and  again  when  R  =  0.  Find  the  fitted  YS 
conditional  odds  ratio  for  each  case.  Similarly,  construct  the  prediction  equation 
when  S  =  1  and  again  when  5=0.  Find  the  fitted  YR  conditional  odds  ratios.  Note 
that  for  each  association,  the  coefficient  of  R  x  5  is  the  difference  between  the  log 


Table  5.18  Data  for  Exercise  5.19  on  Esophageal  Cancer 


Variable 

Effect 

/*- value 

Intercept 

-7.00 

<0.01 

Alcohol  use  (A) 

0.10 

0.03 

Smoking  (5) 

1.20 

<0.01 

Race  ( R ) 

0.30 

0.02 

Race  x  smoking  (R  x  S) 

0.20 

0.04 
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odds  ratios  at  the  two  fixed  levels  for  the  other  variable.  Explain  why  the  coefficient 
of  S  represents  the  log  odds  ratio  between  Y  and  S  for  whites.  To  what  hypotheses 
do  the  P- values  for  R  and  S  refer? 

5.20  A  survey  of  high  school  students  on  Y  =  whether  the  subject  has  driven  a  motor 
vehicle  after  consuming  a  substantial  amount  of  alcohol  ( 1  =  yes),  s  =  sex  ( 1  = 
female),  r  =  race  (1  =  black;  0  —  white),  and  g  =  grade  (gi  =  1,  grade  9;  g2  =  1, 
grade  10;  g3  =  1,  grade  1 1 ;  gi  =  gi  =  ,?3  =  0,  grade  12)  has  prediction  equation 

logit[P(T  =  1)]  =  -0.88  -  0.40s  -  0.72 r  -  2.22 g,  -  1.43g2  -  0.58g3 
+0.74(r  x  j?|)  +  0.38(r  x  g2)  +  0.01(r  x  g3). 

Carefully  interpret  effects.  Explain  the  interaction  by  describing  the  race  effect  at 
each  grade  and  the  grade  effect  for  each  race. 

5.21  The  Gallup  Poll  reported  in  March  2010  that  the  percentage  believing  that  news 
reports  exaggerate  the  seriousness  of  global  warming  is  66%  for  Republicans  and 
22%  for  Democrats.  By  contrast,  in  1998  the  corresponding  percentages  were  34% 
and  23%.  Considered  as  results  for  a  three-way  table  cross-classifying  opinion  by 
political  party  and  year,  do  these  data  seem  to  display  interaction?  In  what  sense? 

5.22  A  table  at  the  text  website  refers  to  a  sample  of  subjects  randomly  selected  for  an 
Italian  study  on  the  relation  between  income  and  whether  one  possesses  a  travel 
credit  card.  At  each  level  of  annual  income  in  millions  of  lira  (the  Italian  currency 
at  the  time  of  the  study),  the  table  indicates  the  number  of  subjects  sampled  and  the 
number  possessing  at  least  one  travel  credit  card.  Analyze  these  data. 

5.23  A  research  article  in  the  British  Medical  Journal  (by  C.  de  Oliveira  et  ah,  2010,  vol. 
340)  showed  results  from  the  Scottish  Health  Survey,  indicating  that  over  a  period 
of  about  8  years,  cardiovascular  disease  events  occurred  for  308  of  848 1  subjects 
who  reported  brushing  their  teeth  at  least  twice  a  day,  for  1 88  of  2850  subjects  who 
reported  brushing  once  a  day,  and  for  59  of  538  subjects  who  reported  brushing  less 
than  once  a  day.  Analyze  these  data. 

5.24  Are  people  with  more  social  ties  less  likely  to  get  colds?  Use  logistic  models  to 
analyze  the  2  x  2  x  2  x  2  contingency  table  on  p.  1943  of  the  article  by  S.  Cohen 
et  ah,  J.  Am.  Med.  Assoc.  277  (24). 

Theory  and  Methods 

5.25  For  logistic  regression  model  (5.1),  show  that  dn(x)/dx  =  v)[l  —  n(x)]. 

5.26  For  logistic  model  (5.1),  when  n(x)  is  small,  explain  why  you  can  interpret  exp(/l) 
approximately  as  ;r(.v  +  \)/n{.x). 

5.27  Prove  that  the  logistic  regression  curve  (5. 1 )  has  the  steepest  slope  where  n(x)  =  3 . 
Generalize  to  model  (5.8). 
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5.28  The  calibration  problem  is  that  of  estimating  x  at  which  tt(x)  —  ttq  for  some  fixed 
7 To  such  as  0.50.  For  the  linear  logit  model,  argue  that  a  confidence  interval  is  the  set 
of  x  values  for  which 

\a  +  fix  -  logit(TTo) | /[ var(df)  +  x2  var(/3)  +  2x  cov(a,  /3)]l/2  <  za/2. 

An  alternative  approach  inverts  a  likelihood-ratio  test. 

5.29  A  study  for  several  professional  sports  of  the  effect  of  a  player’s  draft  position 
d  (d  =  1,  2,  3, . . .)  of  selection  from  the  pool  of  potential  players  in  a  given  year  on 
the  probability  n  of  eventually  being  named  an  all  star  used  the  model  logit(7r)  = 
ct  +  yd  log  d  (S.  M.  Berry,  Chance ,  14(2):  53-57,  2001). 

a.  Show  that  7r/(l  —  n)  —  eadC  Show  that  ea  =  odds  for  the  first  draft  pick. 

b.  In  the  United  States,  Berry  reported  a  =  2.3  and  yd  =  —1.1  for  pro  basketball 
and  a  —  0.7  and  0  =  —0.6  for  pro  baseball.  This  suggests  that  in  basketball  a 
first  draft  pick  is  more  crucial  and  picks  with  high  d  are  relatively  less  likely  to 
be  all-stars.  Explain  why. 

5.30  For  the  population  having  Y  =  y,  supposed  has  a  7V(/i;,  a2)  distribution,  j  =  0,  1. 

a.  Using  Bayes’  theorem,  show  that  P(Y  =  l|x)  satisfies  the  logistic  regression 
model  with  /l  =  (/r(  —  //,0)/cr2. 

b.  Suppose  that  (X|K  =  j)  is  N(nj,crj)  with  a0  ^  0\.  Show  that  the  logistic 
model  holds  with  a  quadratic  term  (Anderson  1975).  [Exercise  5.8  showed  that  a 
quadratic  term  is  helpful  when  x  values  have  quite  different  dispersion  at  y  =  0 
and  y  =  1.  This  result  also  suggests  that  to  test  equality  of  means  of  normal  dis¬ 
tributions  when  the  variances  differ,  we  can  fit  a  quadratic  logistic  regression  with 
the  two  groups  as  the  response  and  test  the  linear  and  quadratic  terms  together; 
see  O’Brien  (1988).] 

c.  Suppose  that  (X|T  =  j)  has  an  exponential  family  density  f(x\6j)  = 
a(6j)h{x) exp[x Q(6j)].  Show  that  P(Y  =  l|x)  satisfies  the  logistic  model,  with 
effect  ofx  equal  to  [Q{0\)  —  Q{0q)]. 

d.  For  multiple  predictors,  suppose  that  (X|T  =  j)  has  a  multivariate  E) 

distribution,  j  —  0,  1.  Show  that  P(Y  —  l|x)  satisfies  logistic  regression  with 
effect  parameters  ZU'(/t|  —  /to)  (Cornfield  1962,  Warner  1963). 

5.31  Suppose  that  7r(x)  =  F(x)  for  some  strictly  increasing  cdf  F.  Explain  why  a  mono¬ 
tone  transformation  of  x  exists  such  that  the  logistic  regression  model  holds.  Gener¬ 
alize  to  alternative  link  functions. 

5.32  For  an  I  x  2  contingency  table,  consider  logistic  model  (5.4). 

a.  Given  {7 r,-  >  0],  show  how  to  find  {/),  }  satisfying  /)/  =  0. 

b.  Prove  that  fi\  =  f)2  =  ■  ■  ■  =  fi/  is  the  independence  model.  Find  its  likelihood 
equation,  and  show  that  a  =  logit [(]C(  y/)/QZ/  «/)]- 
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5.33  For  a  multinomial  distribution,  let  y  =  b, n, ,  and  suppose  that  n,  =  fj(6)  >  0, 
i  =  1 For  sample  proportions  {/?,},  let  S  —  JT  ft,  p/ .  Let  T  —  JT ■  b,  A, ,  where 
A,  =  fi(6),  for  the  ML  estimator  0  of  F. 

a.  Show  that  var(S)  =  [£,■  fr27r,  -  (JT  bm r,)2]/«. 

b.  Using  the  delta  method,  show  var(T)  %  [var(0)][£T  /),  //($) ]2. 

c.  By  computing  the  information  for  L(0)  =  V,  /?,  logl  /',(F)|,  show  that  var(F)  is 
approximately  \n  Y,i(f'(0)j1/fi(0)]~' . 

d.  Asymptotically,  show  that  var[ y/n(T  —  y )]  <  var[^/w(S  —  y )].  [Hint:  Show 
that  var(T)/var(S)  is  a  squared  correlation  between  two  random  vari¬ 
ables,  where  with  probability  tt,  the  first  equals  b ,  and  the  second  equals 

firn/fm.] 

5.34  Construct  the  log-likelihood  function  for  the  model  logit[7r(x)]  =  a  +  fix  with  in¬ 
dependent  binomial  outcomes  of  Vo  successes  in  «o  trials  at  x  =  0  and  Vi  successes 
in  «|  trials  at  x  —  1.  Derive  the  likelihood  equations,  and  show  that  j3  is  the  sample 
log  odds  ratio. 

5.35  A  study  has  independent  binary  observations  { v,  i , . . . ,  y,„,  }  when  X  —  x, , 
i  =  1, . . . ,  A,  with  n  =  ni-  Consider  the  model  logit(7r,)  =  a  +  fix,,  where 
7t, ,  =  P(Y,j  =  1). 

a.  Show  that  the  kernel  of  the  likelihood  function  is  the  same  treating  the  data  as  n 
Bernoulli  observations  or  N  binomial  observations. 

b.  For  the  saturated  model,  explain  why  the  likelihood  function  is  different  for  these 
two  data  forms.  [Hint:  The  number  of  parameters  differs.]  Hence,  the  deviance 
reported  by  software  depends  on  the  form  of  data  entry. 

c.  Explain  why  the  difference  between  deviances  for  two  unsaturated  models  does 
not  depend  on  the  form  of  data  entry. 

d.  Suppose  that  each  n,  —  1.  Show  that  the  deviance  depends  on  7f,  but  not  y,. 
Hence,  it  is  not  useful  for  checking  model  fit  (see  also  Exercise  4.18). 

5.36  Suppose  that  Y  has  a  bin(«,  7 r)  distribution.  For  the  model,  logit(7r)  =  a,  consider 
testing  //(>:  a  =  0  (i.e.,  n  =  0.50).  Let  A  =  y/n. 

a.  Compare  the  estimated  SE  for  the  Wald  test  and  the  SE  using  the  null  value  0.50 
for  7 r,  for  two  possible  denominators  in  the  test  statistic  [logit(7f  )/SE]2.  Show 
that  the  ratio  of  the  Wald  statistic  to  the  statistic  with  null  SE  equals  4A(\  —  A). 
What  is  the  implication  about  performance  of  the  Wald  test  if  |a|  is  large  and  A 
tends  to  be  near  0  or  1  ? 

b.  How  does  the  comparison  of  tests  change  with  the  scale  [(A  —  0.5 )/S£]2,  where 
SE  is  now  the  estimated  or  null  SE  of  7 r?  [Analogous  results  apply  for  infer¬ 
ence  about  the  Poisson  mean  versus  the  log  mean;  see  also  Mantel  (1987a)  and 
Section  5.2.6.] 

5.37  Find  the  likelihood  equations  for  model  (5. 1 0)  with  two  binary  predictors.  Show  that 
they  imply  that  the  fitted  values  and  the  sample  counts  are  identical  in  the  marginal 
two-way  tables. 
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5.38  Consider  the  likelihood  equations  (5.18)  for  a  logistic  regression  model.  Using 
the  equation  resulting  from  the  intercept  parameter,  show  that  the  overall  sample 
proportion  of  successes  equals  the  sample  mean  of  the  fitted  success  probabilities. 

5.39  Consider  the  linear  logit  model  (5.5)  for  an  /  x  2  table,  with  y,  a  bin(«,,  7r,  )  variate. 

a.  Show  that  the  log  likelihood  is 

l  l 

CQ8)  =  ^y,(a  +  ~  E  /!,■  log[l  +  exp(a  +  fiXj)]. 

/=i  i=i 

b.  Show  that  the  sufficient  statistic  for  fi  is  JE  and  explain  why  this  is  essen¬ 
tially  the  variable  utilized  in  the  Cochran-Armitage  test.  (That  test  is  a  score  test 
of  H0\  p  =  0.) 

c.  Letting  S  =  JE  jy, ,  show  that  the  likelihood  equations  are 

^  exp  (a  +  pxj) 

S  —  >  - , 

~  1  +  exp(a  +  pxj) 

y  exp(a  +  fixi) 

v,Jr,  =  >  rijXi - . 

“  1  +  exp(a  +  pxj) 


d.  Let  {/2,  =  njfti).  Explain  why  JE  =  J2i  y <  and 


E 

i 


A; 

Ao 


Explain  why  this  implies  that  the  mean  score  on  x  across  the  rows  in  the  first 
column  is  the  same  for  the  model  fit  as  for  the  observed  data.  (They  are  also 
identical  for  the  second  column.) 


5.40  Let  Yj  be  bin(/?/,  jz,  )  at  x, ,  and  let  /?,  =  .  For  binomial  GLMs  with  logit  link: 

a.  For  pi  near  n, ,  show  that 


log 


Pi 

1  -  pi 


log 


Tti 


1  -  71  j 


Pi  ~  71  j 

7T,(1  -  7T,)' 


b.  Show  that  z‘n  in  (5.24)  is  a  linearized  version  of  the  ;th  sample  logit,  evaluated 
at  approximation  7r((,)  for  ft,. 

c.  Verify  the  formula  (5.21)  forcov(/3). 


CHAPTER  6 


Building,  Checking,  and  Applying 
Logistic  Regression  Models 


Having  studied  the  basics  of  fitting  and  interpreting  logistic  regression  models,  we  now 
turn  our  attention  to  building  and  applying  them.  With  several  explanatory  variables,  there 
are  many  potential  models.  In  Section  6. 1  we  discuss  strategies  for  model  selection.  After 
choosing  a  preliminary  model,  model  checking  addresses  whether  systematic  lack  of  fit 
exists.  Section  6.2  covers  diagnostics,  such  as  residuals,  for  model  checking.  Section  6.3 
presents  ways  of  summarizing  the  predictive  power  of  a  model. 

In  practice,  an  important  application  is  comparing  two  groups  on  a  binary  response, 
while  adjusting  for  possibly  confounding  variables.  In  Section  6.4  we  present  the 
Cochran-Mantel-Haenszel  test,  a  popular  way  to  do  this  by  forming  strata  for  levels 
of  control  variables.  We  then  present  ways  of  summarizing  the  effect,  with  application  to 
meta-analyses. 

Infinite  estimates  of  logistic  regression  model  parameters  can  occur  with  certain  data 
configurations.  Section  6.5  discusses  ways  to  detect  and  deal  with  them.  Section  6.6  covers 
power  and  sample  size  determination  for  logistic  regression. 


6.1  STRATEGIES  IN  MODEL  SELECTION 

Model  selection  for  logistic  regression  faces  the  same  issues  as  for  ordinary  regression.  The 
selection  process  becomes  harder  as  the  number  of  explanatory  variables  increases,  because 
of  the  rapid  increase  in  possible  effects  and  interactions.  There  are  two  competing  goals: 
The  model  should  be  complex  enough  to  fit  the  data  well.  On  the  other  hand,  ideally  it  should 
be  relatively  simple  to  interpret,  smoothing  rather  than  overfitting  the  data.  Complications 
can  arise  because  of  the  binary  nature  of  the  response  variable,  such  as  infinite  ML  parameter 
estimates  for  some  models  when  one  response  outcome  is  much  more  common  than  the 
other. 

Most  research  studies  are  designed  to  answer  certain  questions.  Those  questions  guide 
the  choice  of  model  terms.  Confirmatory  analyses  then  use  a  restricted  set  of  models.  For 
instance,  a  study  hypothesis  about  an  effect  may  be  tested  by  comparing  models  with  and 
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without  that  effect.  For  studies  that  are  exploratory  rather  than  confirmatory,  a  search  among 
possible  models  may  provide  clues  about  the  dependence  structure  and  raise  questions  for 
future  research.  In  either  case,  it  is  helpful  first  to  study  the  effect  of  each  predictor  by 
itself  using  graphics  (incorporating  smoothing)  for  a  continuous  predictor  or  conditional 
distributions  within  a  contingency  table  for  a  discrete  predictor.  This  gives  you  a  feel  for 
the  marginal  effects. 

6.1.1  How  Many  Explanatory  Variables  Can  Be  in  the  Model? 

Unbalanced  data,  with  relatively  few  responses  of  one  type,  limit  the  number  of  predictors 
for  which  we  can  effectively  estimate  effects.  One  guideline  based  on  a  Monte  Carlo  study 
(Peduzzi  et  al.  1996)  suggested  that  when  there  are  fewer  than  10  outcomes  of  each  type 
per  predictor,  impacts  can  include  severely  biased  parameter  estimates,  poor  standard  error 
estimates,  and  error  rates  for  Wald  tests  and  confidence  intervals  far  from  the  nominal  level. 
If  y  —  1  only  30  times  out  of  n  =  1000,  for  instance,  this  guideline  implies  that  ideally  the 
model  should  contain  no  more  than  three  predictors. 

This  is  merely  one  guideline  and  does  not  mean  that  you  should  never  consider  models 
that  violate  it.  Many  data  sets  now  have  large  numbers  of  variables  relative  to  the  sample 
size.  With  certain  strategies  presented  in  Chapter  7,  such  as  penalized  likelihood  methods 
that  can  shrink  many  estimates  to  0,  it  is  possible  to  have  very  many  predictors.  Likewise, 
you  should  not  use  such  a  guideline  to  justify  being  overly  ambitious.  For  example,  if  you 
have  1000  outcomes  of  each  type,  you  are  not  usually  well  served  by  a  model  with  100 
predictors. 

Many  model  selection  procedures  exist,  no  one  of  which  is  always  best.  Cautions  that  ap¬ 
ply  to  ordinary  regression  hold  for  any  generalized  linear  model.  For  instance,  a  model  with 
several  explanatory  variables  may  exhibit  multicollinearity — correlations  among  them  mak¬ 
ing  it  seem  that  no  one  variable  is  important  when  all  the  others  are  in  the  model.  A  variable 
may  seem  to  have  little  effect  because  it  overlaps  considerably  with  the  other  explanatory 
variables  in  the  model,  itself  being  predicted  well  by  the  others.  Deleting  such  a  redundant 
variable  can  be  helpful,  for  instance,  to  reduce  standard  errors  of  other  estimated  effects. 

6.1.2  Example:  Horseshoe  Crab  Mating  Data  Revisited 

The  horseshoe  crab  data  set  in  Table  4.3  has  four  explanatory  variables:  color  (four  cat¬ 
egories),  spine  condition  (three  categories),  weight,  and  width  of  the  shell.  We  now  fit 
a  logistic  regression  model  using  all  these  to  predict  whether  the  female  crab  has  male 
satellites  nearby  (y  =  1). 

We  start  by  fitting  a  model  containing  all  the  main  effects, 

logit[F’(T  =  1)]  =  a  +  Pi  weight  +  /^width  +  p^cy 

+  /fro  +  P5C3  +  Pf>s\  +  Pi  $2' 

treating  color  (c,)  and  spine  condition  (sj )  as  qualitative  (factors),  with  indicator  variables 
for  the  first  three  colors  and  the  first  two  spine  conditions.  Table  6.1  shows  results.  A 
likelihood-ratio  test  that  Y  is  jointly  independent  of  these  predictors  simultaneously  tests 
Ho'.  Pi  =  •  ■  •  =  p-j  =  0.  The  test  statistic  equals  40.56  with  df  =  7  (P  <  0.0001).  This 
shows  extremely  strong  evidence  that  at  least  one  predictor  has  an  effect. 
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Table  6.1  Software  Output  (Based  on  SAS)  from  Fitting  Model  with 
All  Main  Effects  to  Horseshoe  Crab  Data 

Testing  Global  Null  Hypothesis:  BETA =  0 
Test  Chi-Square  DF  Pr  >  ChiSq 

Likelihood  Ratio  40.5565  7  <.0001 


Analysis  of  Maximum  Likelihood  Estimates 


Parameter 

Estimate 

Std  Error 

Chi-Square 

Pr  >  ChiSq 

Intercept 

-9.2734 

3.8378 

5 . 8386 

0 . 0157 

weight 

0.8258 

0.7038 

1 . 3765 

0 . 2407 

width 

0.2631 

0 . 1953 

1 .8152 

0 . 1779 

color  1 

1.6087 

0 . 9355 

2 . 9567 

0 . 0855 

color  2 

1.5058 

0 . 5667 

7 . 0607 

0 . 0079 

color  3 

1.1198 

0 . 5933 

3 . 5624 

0 . 0591 

spine  1 

-0.4003 

0 . 5027 

0 . 6340 

0.4259 

spine  2 

-0.4963 

0 . 6292 

0 . 6222 

0.4302 

Although  the  overall  test  is  highly  significant,  the  Table  6. 1  results  are  discouraging.  The 
estimates  for  weight  and  width  are  only  slightly  larger  than  their  SE  values.  The  estimates 
for  the  factors  compare  each  category  to  the  final  one  as  a  baseline.  For  color,  only  one 
effect  is  clearly  significant;  for  spine  condition,  the  largest  difference  is  less  than  a  standard 
error. 

The  small  P-value  for  the  overall  test,  yet  the  lack  of  significance  for  individual  effects,  is 
a  warning  sign  of  multicollinearity.  In  Section  5.2.2  we  showed  strong  evidence  of  a  width 
effect.  Adjusting  for  weight,  color,  and  spine  condition,  little  evidence  remains  of  a  partial 
width  effect.  However,  weight  and  width  have  a  strong  correlation  (0.887).  For  practical 
purposes  they  are  equally  good  predictors,  but  it  is  nearly  redundant  to  use  them  both. 
Our  further  analysis  uses  width  (W/)  with  color  (C)  and  spine  condition  ( S )  as  explanatory 
variables.  For  simplicity,  we  symbolize  models  by  their  highest-order  terms,  regarding  C 
and  S  as  factors.  For  instance,  (C  +  S  +  IV)  denotes  a  model  with  main  effects,  whereas 
(C  +  S  *  W)  denotes  a  model  that  has  those  main  effects  plus  an  S  x  W  interaction.  It  is 
not  usually  sensible  to  consider  a  model  with  interaction  that  does  not  also  contain  the  main 
effects  that  make  up  that  interaction. 


6.1.3  Stepwise  Procedures:  Forward  Selection  and  Backward  Elimination 

In  exploratory  studies,  an  algorithmic  method  for  searching  among  models  can  be  informa¬ 
tive  if  we  use  results  cautiously.  Goodman  (1971a)  proposed  methods  analogous  to  forward 
selection  and  backward  elimination  in  ordinary  regression. 

Forward  selection  adds  terms  sequentially.  At  each  stage  it  selects  the  term  giving  the 
greatest  improvement  in  fit.  The  minimum  P-value  for  testing  the  term  in  the  model  is  a 
sensible  criterion,  since  reductions  in  deviance  for  different  terms  may  have  different  df 
values.  A  point  of  diminishing  returns  occurs  in  adding  predictors,  when  new  predictors  are 
so  correlated  with  ones  already  used  that  they  do  not  improve  predictive  power.  The  process 
stops  when  further  additions  do  not  significantly  improve  the  fit.  A  stepwise  variation  of 
this  procedure  retests,  at  each  stage,  terms  added  at  previous  stages  to  see  if  they  are  still 
significant. 
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Backward  elimination  begins  with  a  complex  model  and  sequentially  removes  terms.  At 
each  stage,  it  selects  the  term  whose  removal  has  the  least  damaging  effect  on  the  model 
(e.g.,  largest  P-value).  The  process  stops  when  any  further  deletion  leads  to  a  significantly 
poorer  fit.  With  either  approach,  for  qualitative  predictors  with  more  than  two  categories,  the 
process  should  consider  the  entire  variable  at  any  stage  rather  than  just  individual  indicator 
variables.  Add  or  drop  the  entire  variable  rather  than  just  one  of  its  indicators.  Otherwise, 
the  result  depends  on  the  choice  of  baseline  for  the  indicator  coding.  The  same  remark 
applies  to  interactions  containing  that  variable. 

Some  statisticians  prefer  backward  elimination  over  forward  selection,  feeling  it  safer 
to  delete  terms  from  an  overly  complex  model  than  to  add  terms  to  an  overly  simple 
one.  Forward  selection  can  stop  prematurely  because  a  particular  test  in  the  sequence  has 
low  power.  Neither  strategy  necessarily  yields  a  meaningful  model.  Use  variable  selection 
procedures  with  caution!  Various  studies  have  shown  their  limitations  and  pitfalls  (e.g., 
Steyerberg  et  al.  2001).  When  you  evaluate  many  terms,  one  or  two  that  are  not  truly 
important  may  look  impressive  merely  due  to  chance.  For  instance,  when  all  the  true  effects 
are  weak,  the  largest  sample  effect  is  likely  to  overestimate  substantially  its  true  effect.  It 
is  best  to  use  such  algorithms  in  an  informal  manner.  This  includes  the  interpretation  of 
P-values  used  as  cutoff  points,  since  the  distribution  of  the  minimum  or  maximum  P-value 
evaluated  over  a  set  of  predictors  is  not  the  same  as  that  of  a  Z3- value  for  a  preselected 
variable. 

Some  software  has  additional  options  for  selecting  a  model.  One  approach  attempts  to 
determine  the  best  model  with  some  fixed  number  of  terms,  according  to  some  criterion.  If 
such  a  method  and  backward  and  forward  selection  procedures  yield  quite  different  models, 
this  is  an  indication  that  such  results  are  of  dubious  use.  Another  such  indication  would  be 
when  a  quite  different  model  results  from  applying  a  given  procedure  to  a  bootstrap  sample 
of  the  same  size  from  the  sample  distribution. 

Finally,  statistical  significance  should  not  be  the  sole  criterion  for  inclusion  of  a  term 
in  a  model,  and  true  significance  can  be  difficult  to  judge  in  any  case  (Westfall  and  Young 
1993).  It  is  sensible  to  include  a  variable  that  is  central  to  the  purposes  of  the  study  and 
report  its  estimated  effect  even  if  it  is  not  statistically  significant.  Keeping  it  in  the  model 
may  help  reduce  bias  in  estimated  effects  of  other  predictors  and  may  make  it  possible 
to  compare  results  with  other  studies  where  the  effect  is  significant,  perhaps  because  of  a 
larger  sample  size.  Algorithmic  selection  procedures  are  no  substitute  for  careful  thought 
in  guiding  the  formulation  of  models. 


6.1.4  Example:  Backward  Elimination  for  Horseshoe  Crab  Data 

Table  6.2  summarizes  results  of  fitting  and  comparing  several  logistic  models  to  the  horse¬ 
shoe  crab  data  with  predictors  width,  color,  and  spine  condition.  The  deviance  (G2)  test  of 
fit  compares  the  model  to  the  saturated  model.  As  noted  in  Sections  5.2.4  and  5.2.5,  this 
is  not  approximately  chi-squared  when  a  predictor  is  continuous,  as  width  is.  However, 
the  deviance  difference  between  two  models  that  differ  by  a  modest  number  of  parame¬ 
ters  is  relevant.  That  difference  is  the  likelihood-ratio  statistic  —2(Lq  —  L\)  comparing  the 
models,  and  it  has  an  approximate  null  chi-squared  distribution. 

To  select  a  model,  we  use  backward  elimination,  at  each  stage  testing  only  the  highest- 
order  terms  for  each  variable.  It  is  inappropriate,  for  instance,  to  remove  a  main  effect  term 
if  the  model  has  interactions  involving  that  term. 
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Table  6.2  Results  of  Fitting  Several  Logistic  Regression  Models  to  Horseshoe  Crab  Data 


Model 

Predictors" 

Deviance 
G 2 

df 

AIC 

Models 

Compared 

Deviance 

Difference 

Corr. 
R(y,  A) 

1 

(C*S*W) 

170.44 

152 

212.4 

— 

— 

2 

(C*S  +  C*W/  +  S*IV) 

173.68 

155 

209.7 

(2HD 

3.2  (df  = 

3) 

3a 

(C*S  +  S*W) 

177.34 

158 

207.3 

(3a)-(2) 

3.7 (df  = 

3) 

3b 

(C  *  W  +  S  *  W) 

181.56 

161 

205.6 

(3b)— (2) 

7.9  (df  = 

6) 

3c 

(C  *S  +  C  *W) 

173.69 

157 

205.7 

(3c)-(2) 

0.0  (df  = 

2) 

4a 

(S  +  C  *W) 

181.64 

163 

201.6 

(4a)-(3c) 

8.0  (df  = 

6) 

4b 

(W  +C  *S) 

177.61 

160 

203.6 

(4b)-(3c) 

3.9  (df  = 

3) 

5 

(C  +  S  +  W) 

186.61 

166 

200.6 

(5M4b) 

9.0 (df  = 

6) 

0.456 

6a 

(C  +  S) 

208.83 

167 

220.8 

(6a)-(5) 

22.2  (df  = 

1) 

0.314 

6b 

(S  +  W) 

194.42 

169 

202.4 

(6b)— (5) 

7.8 (df  = 

3) 

0.402 

6c 

(' C  +  W ) 

187.46 

168 

197.5 

(6c)-(5) 

0.8  (df  = 

2) 

0.452 

7a 

(C) 

212.06 

169 

220.1 

(7a)-(6c) 

24.5  (df  = 

1) 

0.285 

7b 

(W) 

194.45 

171 

198.5 

(7b)-(6c) 

7.0 (df  = 

3) 

0.402 

8 

(C  =  dark  +  W ) 

187.96 

170 

194.0 

(8)-(6c) 

0.5  (df  = 

2) 

0.447 

9 

None 

225.76 

172 

227.8 

(9)— (8) 

37.8  (df  = 

2) 

0.000 

"C.  color;  S.  spine  condition;  W.  width. 


We  begin  with  the  most  complex  model,  symbolized  by  (C  *  S  *  W ),  model  1  in 
Table  6.2.  This  model  uses  main  effects  for  each  term  as  well  as  the  three  two-factor 
interactions  and  the  three-factor  interaction.  It  allows  a  separate  width  effect  at  each  CS 
combination.  (In  fact,  at  some  of  those  combinations  y  outcomes  of  only  one  type  occur, 
which  implies  that  those  effects  are  not  estimable.)  The  likelihood-ratio  statistic  compar¬ 
ing  this  model  to  the  simpler  model  (C*S  +  C*W/  +  S*W/)  removing  the  three-factor 
interaction  term  equals  3.2  (df  =  3).  This  suggests  that  the  three-factor  term  is  not  needed 
(P  —  0.36),  thank  goodness,  so  we  continue  the  simplification  process. 

At  the  next  stage  we  compare  the  model  (C  *  S  +  C  *  W  +  S  *  W)  to  the  simpler 
model  C  +  S  +  W  containing  only  main  effects.  The  likelihood-ratio  statistic  comparing 
the  model  is  the  change  in  deviance,  186.61  —  173.68  =  12.9  (df  =  166  —  155  =  11).  This 
suggests  that  two-factor  interactions  terms  are  not  needed  either  (P  =  0.30).  Table  6.2  also 
shows  results  for  intermediate  models,  and  a  backward  process  dropping  a  term  at  a  time 
also  results  in  eliminating  all  the  three-factor  terms. 

At  the  next  stage  we  consider  dropping  a  main  effect  term.  Table  6.2  shows  little 
consequence  of  removing  S.  Both  remaining  variables  (C  and  W)  then  have  nonnegligible 
effects.  For  instance,  removing  C  increases  the  deviance  (comparing  models  7b  and  6c)  by 
7.0  on  df  =  3  (P  =  0.07).  The  analysis  in  Section  5.4.6  revealed  a  noticeable  difference 
between  dark  crabs  (category  4)  and  the  others.  The  simpler  model  that  has  a  single  indicator 
variable  for  color,  equaling  0  for  dark  crabs  and  1  otherwise,  fits  essentially  as  well.  Further 
simplification  results  in  large  increases  in  deviance  and  is  unjustified. 


6.1.5  Model  Selection  and  the  “Correct”  Model 

In  selecting  a  model  from  a  set  of  candidates,  we  are  mistaken  if  we  think  that  there  is  a 
"correct”  one.  Any  model  is  a  simplification  of  reality.  For  instance,  width  does  not  have 


212 


BUILDING,  CHECKING,  AND  APPLYING  LOGISTIC  REGRESSION  MODELS 


exactly  a  linear  effect  on  the  probability  of  satellites,  whether  we  use  the  logit  link  or  the 
identity  link. 

What  is  the  logic  of  testing  the  fit  of  a  model  when  we  know  that  it  does  not  truly  hold? 
A  simple  model  that  fits  adequately  has  the  advantages  of  model  parsimony.  If  a  model  has 
relatively  little  bias,  describing  reality  well,  it  tends  to  provide  more  accurate  estimates  of 
the  quantities  of  interest.1 

Other  criteria  besides  significance  tests  can  help  select  a  good  model  in  terms  of  esti¬ 
mating  quantities  of  interest.  We  next  introduce  the  best  known  of  such  criteria. 


6.1.6  AIC:  Minimizing  Distance  of  the  Fit  from  the  Truth 

The  Akaike  information  criterion  (AIC)  judges  a  model  by  how  close  its  fitted  values 
tend  to  be  to  the  true  mean  values,  in  terms  of  a  certain  expected  value.  Even  though  a 
simple  model  is  farther  from  the  true  relationship  than  is  a  more  complex  model,  it  may  be 
preferred  because  it  tends  to  provide  better  estimates  of  certain  characteristics,  such  as  cell 
probabilities.  Thus,  the  optimal  model  is  the  one  that  tends  to  have  fit  closest  to  the  true 
values. 

Akaike  defined  closeness  in  terms  of  a  Kullback-Leibler  measure  of  distance.  Let 
p(y)  denote  the  probability  (or  density)  of  the  data  under  the  true  model  and  Pm(y) 
the  probability  under  the  chosen  model.  The  distance  measure  is  £'{log[p(y)/piu(y)]}, 
where  the  expected  value  is  taken  relative  to  the  true  distribution.  For  categorical  data, 
this  measure  resembles  G 2  in  form.  With  a  sample,  this  criterion  selects  the  model  that 
minimizes 

AIC  =  —2  (maximized  log  likelihood  —  number  of  parameters  in  model). 

This  penalizes  a  model  for  having  many  parameters.  With  models  for  categorical  Y ,  this 
ordering  is  equivalent  to  one  based  on  an  adjustment  of  the  deviance,  [G2  —  2(df)],  by  twice 
its  residual  df. 

With  many  potential  predictors,  we  can  use  the  AIC  to  aid  in  variable  selection.  Out 
of  a  set  of  candidate  models,  we  identify  the  one  with  smallest  AIC.  However,  models 
with  similar  AIC  values  are  also  of  interest.  For  instance,  we  would  consider  also  more 
parsimonious  models  that  have  AIC  relatively  close  to  the  minimum  value. 

We  illustrate  AIC  for  model  selection  using  the  models  that  Table  6.2  lists.  That  table 
also  shows  the  AIC  values.  Of  models  using  the  three  basic  variables,  AIC  is  smallest 
(AIC  —  197.5)  for  C  +  W,  having  main  effects  of  color  and  width.  The  simpler  model 
having  an  indicator  variable  for  whether  a  crab  is  dark  fares  better  yet  (AIC  =  194.0). 
Either  model  seems  reasonable.  We  should  balance  the  lower  AIC  for  the  simpler  model 
against  its  having  been  suggested  by  the  fit  of  model  C  +  W . 

An  alternative  Bayesian  information  criterion  (B1C)  penalizes  more  severely  for  the 
number  of  parameters  in  the  model.  It  replaces  2  by  log(n)  as  the  multiple  of  the  number 
of  parameters,  so  the  selected  model  is  no  more  complex  than  the  one  selected  with 
AIC.  Compared  with  AIC,  BIC  gravitates  less  quickly  toward  more  complex  models  as  n 
increases.  It  is  derived  based  on  a  Bayesian  argument  for  determining  which  of  a  set  of 
models  has  highest  posterior  probability.  Differences  between  BIC  values  for  two  models 


We  discussed  the  parsimony  issue,  with  examples,  in  Sections  3.3.8,  5.2.2,  and  5.3. 10. 
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relate  to  a  Bayes  factor  comparing  them.  It  has  the  property  of  selecting  the  “correct  model” 
with  probability  converging  to  1  as  n  — >  oo.  However,  this  is  based  on  the  Bayesian  structure 
that  provides  justification  for  this  approach,  and  its  relevance  is  unclear  when  applied  with 
frequentist  methods.  Also,  in  practice  we  do  not  regard  any  one  model  as  “correct,”  so  the 
AIC  approach  of  choosing  the  model  that  is  closest  to  reality  seems  sensible. 

For  the  horseshoe  crab  mating  data,  from  Table  6.2,  AIC  —  197.5  for  model  (C  +  W) 
and  AIC  =  198.5  for  model  (W).  By  contrast,  BIC  =  213.2  for  model  (C  +  W)  and 
BIC  =  204.8  for  model  (VF),  thus  differing  from  AIC  by  preferring  the  simpler  model. 


6.1.7  Example:  Using  Causal  Hypotheses  to  Guide  Model  Building 

Although  selection  procedures  are  helpful  exploratory  tools,  the  model-building  process 
should  utilize  theory  and  common  sense.  Often,  a  time  ordering  among  the  variables 
suggests  possible  casual  relationships.  Analyzing  a  certain  sequence  of  models  helps  to 
investigate  those  relationships  (Goodman  1973). 

We  illustrate  with  Table  6.3,  from  a  British  study  that  employed  a  random  sample  survey. 
A  sample  of  men  and  women  who  had  petitioned  for  divorce  and  an  independent  sample 
of  married  people  were  asked:  (a)  “Before  you  married  your  (former)  husband/wife,  had 
you  ever  made  love  with  anyone  else?”;  (b)  “During  your  (former)  marriage,  (did  you 
have)  have  you  had  any  affairs  or  brief  sexual  encounters  with  another  man/woman?”  The 
2  x  2  x  2  x  2  table  has  variables  G  =  gender,  E  =  extramarital  sex  report  (yes  or  no), 
P  —  premarital  sex  report,  and  M  =  marital  status. 

The  time  points  at  which  responses  on  the  four  variables  occur  suggests  the  following 
ordering  of  the  variables: 

G  ->  P  E  — ►  M 

gender  premarital  extramarital  marital 
sex  sex  status 

Any  of  these  is  an  explanatory  variable  when  a  variable  listed  to  its  right  is  the  response. 
Figure  6. 1  shows  one  possible  causal  structure.  In  this  figure,  a  variable  at  the  tip  of  an  arrow 
is  a  response  for  a  model  at  some  stage.  The  explanatory  variables  have  arrows  pointing 
toward  the  response,  directly  or  indirectly. 

We  first  treat  P  as  a  response.  Figure  6.1  predicts  that  G  has  a  direct  effect  on  P,  so 
the  model  of  independence  of  these  variables  is  inadequate.  At  the  second  stage,  E  is  the 


Table  6.3  Marital  Status  by  Report  of  Pre-  and  Extramarital  Sex  (PMS  and  EMS) 


Gender 

Women 

Men 

PMS: 

Yes 

No 

Yes 

No 

Marital  Status 

EMS: 

Yes 

No 

Yes 

No 

Yes 

No 

Yes 

No 

Divorced 

17 

54 

36 

214 

28 

60 

17 

68 

Still  married 

4 

25 

4 

322 

1 1 

42 

4 

130 

Source:  G.  N.  Gilbert,  Modelling  Society.  London:  George  Allen  &  Unwin,  1981 .  Reprinted  with  permission  from 
Unwin  Hyman  Ltd. 
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response.  Figure  6,1  predicts  that  P  and  G  have  direct  effects  on  E.  It  also  suggests  that 
G  has  an  indirect  effect  on  E,  through  its  effect  on  P.  These  effects  on  E  can  be  analyzed 
using  the  logistic  model  for  E  with  additive  G  and  P  effects.  If  G  has  only  an  indirect  effect 
on  £,  the  model  with  P  alone  as  a  predictor  is  adequate;  that  is,  at  a  given  level  of  P,  E  and 
G  are  conditionally  independent.  At  the  third  stage,  M  is  the  response.  Figure  6.1  predicts 
that  E  has  a  direct  effect  on  M ,  P  has  direct  effects  and  indirect  effects  through  its  effects 
on  £,  and  G  has  indirect  effects  through  its  effects  on  P  and  E.  This  suggests  the  logistic- 
model  for  M  having  additive  E  and  P  effects.  For  this  model,  G  and  M  are  independent, 
given  P  and  E. 

Table  6.4  shows  results.  The  first  stage,  having  P  as  the  response,  shows  strong  evidence 
of  a  GP  association.  The  sample  odds  ratio  for  their  marginal  table  is  0.27;  the  estimated 
odds  of  premarital  sex  for  females  are  0.27  times  that  for  males.  The  second  stage  has  E  as 
the  response.  Only  weak  evidence  occurs  that  G  had  a  direct  as  well  as  an  indirect  effect 
on  £,  as  G 2  drops  by  2.9  (df  =  1 )  after  adding  G  to  a  model  already  containing  £  as  a 
predictor.  For  this  model,  the  estimated  EP  conditional  odds  ratio  is  3.6. 

The  third  stage  has  M  as  the  response.  Figure  6.1  specifies  the  logistic  model  with 
main  effects  of  £  and  P,  but  it  fits  poorly.  The  model  that  allows  an  £  x  P  interaction  in 
their  effects  on  M  but  assumes  conditional  independence  of  G  and  M  fits  much  better  ( G 2 
decrease  of  13.0,  df  =  I).  The  model  that  also  has  a  main  effect  for  G  fits  slightly  better 
yet.  Either  model  is  more  complicated  than  Figure  6.1  predicted,  since  the  effects  of  £ 
on  M  vary  according  to  the  level  of  P.  However,  some  preliminary  thought  about  causal 
relationships  suggested  a  model  similar  to  one  giving  a  good  fit.  We  leave  it  to  the  reader 
to  estimate  and  interpret  effects  for  the  third  stage. 


Table  6.4  Goodness  of  Fit  of  Various  Models  for  Table  6.3“ 


Stage 

Response 

Variable 

Potential 

Explanatory 

Actual 

Explanatory 

G2 

df 

1 

P 

G 

None 

75.3 

1 

(G) 

0.0 

0 

2 

E 

G,  P 

None 

48.9 

3 

(P) 

2.9 

2 

( G  +  P ) 

0.0 

1 

3 

M 

G,  P.  E 

(E  +  P) 

18.2 

5 

(E  *  P) 

5.2 

4 

(E*P  +G) 

0.7 

3 

"P,  premarital  sex;  E,  extramarital  sex;  M.  marital  status;  G.  gender. 
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6.1.8  Alternative  Strategies,  Including  Model  Averaging 

In  practice,  many  models  can  be  consistent  with  the  data.  If,  as  stated  in  Section  6.1.5,  no 
one  of  them  is  “correct,”  it  is  logically  inconsistent  to  choose  one  model  based  on  its  fitting 
the  data  well  and  then  make  subsequent  inferences  acting  as  if  the  model  is  fixed.  This  can 
result  in  a  tendency  to  underestimate  uncertainty  and  to  exaggerate  significance.  Copas  and 
Eguchi  (2010)  discussed  this  issue.  They  noted  that  an  increasingly  popular  way  of  dealing 
with  this  is  Bayesian  model  averaging:  Identify  a  set  of  plausible  models,  specify  prior 
probabilities  for  them,  and  base  inference  on  a  weighting  according  to  posterior  model 
probabilites.  Copas  and  Eguchi  proposed  an  alternative  approach  that  identifies  statistically 
equivalent  models  (that  are  consistent  with  the  data)  and  constructs  an  “envelope  likelihood” 
that  reflects  the  model  uncertainty.  For  estimation  of  a  particular  measure,  this  approach 
typically  generates  wider  limits  that  more  appropriately  reflect  the  uncertainty. 

As  computing  power  continues  to  explode,  enormous  data  sets  are  more  common,  in 
applications  as  diverse  as  genomic  investigations  and  credit  scoring  by  financial  institutions. 
Many  applications  have  huge  numbers  of  potential  explanatory  variables,  making  model 
selection  much  more  difficult.  We  discuss  special  issues  for  such  cases  in  Section  7.5. 

In  summary,  although  the  focus  of  this  section  has  been  “model  selection,”  it  is  often  not 
sensible  to  have  the  goal  of  picking  a  single  model.  Also,  we  should  keep  in  mind  the  selec¬ 
tion  uncertainty  when  we  make  inferences  based  on  a  model,  and  also  realize  the  tentative 
nature  of  using  the  same  data  in  making  those  inferences  that  were  used  to  select  a  model. 


6.2  LOGISTIC  REGRESSION  DIAGNOSTICS 

In  Section  5.2.3  we  introduced  statistics  for  checking  model  fit  in  a  global  sense.  After 
selecting  a  preliminary  model,  we  obtain  further  insight  by  switching  to  a  microscopic 
mode  of  analysis.  In  contingency  tables,  for  instance,  the  pattern  of  lack  of  fit  revealed  in 
cell-by-cell  comparisons  of  observed  and  fitted  counts  may  suggest  a  better  model  or  may 
indicate  a  segment  of  the  population  for  which  a  generally  good-fitting  model  fails. 

6.2.1  Residuals:  Pearson,  Deviance,  and  Standardized 

With  categorical  predictors,  it  is  useful  to  form  residuals  to  compare  observed  and  fitted 
counts.  Let  y,  denote  the  binomial  outcome  for  n,  trials  at  setting  i  of  the  explanatory 
variables,  i  =  1, . . . ,  N .  Let  if,  denote  the  model  estimate  of  P{Y  =  1).  Then  /x,  =  «,7f,  is 
the  fitted  number  of  successes. 

For  a  GLM  with  binomial  random  component,  for  observation  i  the  Pearson  residual 
(4.41)  is 


y,  -  n,h,  yi-riiTtj 

ei  =  —  ,  ■  (6.1) 

x/var (Yj)  y/lrijjtjO  -  £,-)] 

This  divides  the  raw  residual  (y,  —  /x,)  by  the  estimated  binomial  standard  deviation  of  y,. 
The  Pearson  statistic  for  testing  the  model  fit  satisfies 


=  £4 


1  =  1 
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An  alternative  residual  uses  components  of  the  G 2  fit  statistic.  This  is  the  deviance 
residual ,  introduced  for  GLMs  in  (4.42).  For  a  binomial  GLM,  this  is 


sfd,  x  sign(y,  -  n,n,). 


(6.2) 


where 


d, 


yi  log  +  («, 

n,7Tj 


yi )  log 


y> 


n,  —  niiij 


As  explained  in  Section  4.5.6,  these  and  the  (e,  |  are  less  variable  than  A(0,  1). 

A  standardized  version  of  the  Pearson  residual  divides  it  by  its  estimated  standard  error. 
As  noted  in  Section  4.5.6,  this  is  larger  than  the  Pearson  residual,  with  adjustment  that 
uses  the  leverage  from  an  estimated  hat  matrix.  For  observation  i  with  leverage  h , ,  the 
standardized  residual  is 


.  _  g;  _  V;  ~  rtjAj _ 

x/l  -  hi  \/[«/*;(l  -3T/)(1  -  hi)] 

It  has  the  advantages  compared  with  the  Pearson  and  deviance  residuals  of  having  an 
approximate  N{ 0,  1)  distribution  when  the  model  holds  and  appropriately  recognizing 
redundancies  (as  noted  for  2  x  2  tables  in  Section  3.3.1  and  in  Section  6.2.3  below). 
Absolute  values  larger  than  roughly  2  or  3  provide  evidence  of  lack  of  fit.  It  takes  larger 
values  to  be  noteworthy  when  relatively  more  of  them  are  inspected. 

Plots  of  residuals  against  explanatory  variables  or  linear  predictor  values  may  detect 
a  type  of  lack  of  fit.  When  fitted  values  are  very  small,  however,  just  as  X 2  and  G 2  lose 
relevance,  so  do  residuals.  When  explanatory  variables  are  continuous,  often  n,  =  1  at 
each  setting.  Then  y;  can  equal  only  0  or  1,  and  e,  can  assume  only  two  values.  One 
must  then  be  cautious  about  regarding  either  outcome  as  extreme,  and  a  single  residual  is 
usually  uninformative  (see  Exercise  6.32).  Plots  of  residuals  also  then  have  limited  use. 
Figure  6.2  illustrates,  plotting  for  the  horseshoe  crab  data  the  standardized  residuals  against 
width  for  the  model  (5.13)  fitted  in  Section  5.4.5  having  width  and  color  as  predictors. 
Width  has  a  strong  positive  effect,  so  necessarily  for  small  width  values  an  observation 
of  y  =  1  will  have  a  relatively  large  positive  residual  whereas  for  large  width  values  an 
observation  of  y  =  0  will  have  a  relatively  large  negative  residual.  When  plotted  against 
fitted  values,  a  plot  of  the  raw  residuals  consists  merely  of  two  parallel  lines  of  points. 
The  deviance  itself  is  then  completely  uninformative  (Exercise  5.35).  When  data  can  be 
grouped  into  sets  of  observations  having  common  predictor  values,  it  is  better  to  compute 
residuals  for  the  grouped  data  than  for  individual  subjects. 


6.2.2  Example:  Heart  Disease  and  Blood  Pressure 

A  sample  of  male  residents  of  Framingham,  Massachusetts,  aged  40  through  59,  were 
classified  on  several  factors,  including  systolic  blood  pressure.  The  response  variable  is 
whether  they  developed  coronary  heart  disease  during  a  six-year  follow-up  period.  Table  6.5 
shows  results. 
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Figure  6.2  Plot  of  standardized  residuals  against  width,  for  model  predicting  horseshoe  crab  satellites  using 
width  and  color  predictors. 


Let  7Tj  be  the  probability  of  heart  disease  for  blood  pressure  category  The  table  shows 
the  fit  and  the  standardized  residuals  for  two  logistic  regression  models.  The  first  model, 

logit(7r, )  —  a, 

treats  the  response  as  independent  of  blood  pressure.  Some  residuals  for  that  model 
are  large.  This  is  not  surprising,  since  the  model  fits  poorly  (G2  =  30.02,  X2  =  33.38, 
df  =  7). 

A  plot  of  the  residuals  for  the  independence  model  shows  an  increasing  trend.  This 
suggests  the  linear  logit  model. 


logit(zr, )  =  a  +  /3x,, 


Table  6.5  Presence  of  Heart  Disease  by  Blood  Pressure,  with  Fit  of  Logistic  Models  and 
Standardized  Residuals 


Systolic 

Pressure 

(mmHg) 

Sample 

Size 

Observed 

Heart 

Disease 

Fitted 

Standardized  Residual 

Independence 

Model 

Linear 

Logit 

Independence 

Model 

Linear 

Logit 

<117 

156 

3 

10.8 

5.2 

-2.62 

-1.1  1 

1  17-126 

252 

17 

17,4 

10.6 

-0.12 

2.37 

127-136 

284 

12 

19.7 

15.1 

-2.02 

-0.95 

137-146 

271 

16 

18.8 

18.1 

-0.74 

-0.57 

147-156 

139 

12 

9.6 

1 1.6 

0.84 

0.13 

157-166 

85 

8 

5.9 

8.9 

0.93 

-0.33 

167-186 

99 

16 

6.9 

14.2 

3.76 

0.65 

>186 

43 

8 

3.0 

8.4 

3.07 

-0.18 

Source.  Data  from  Cornfield  (1962). 
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with  scores  {x,}  for  systolic  blood  pressure  level.  We  used  scores  (1 1 1.5,  121.5,  131.5, 
141.5,  151.5,  161.5,  176.5,  191.5).  The  nonextreme  scores  are  midpoints  for  the  intervals 
of  blood  pressure.  The  trend  in  standardized  residuals  disappears  for  this  model,  and  only 
the  second  category  shows  some  evidence  of  lack  of  fit.  A  single  relatively  large  residual  is 
not  surprising,  however.  With  many  residuals,  a  few  may  be  large  merely  by  chance.  Here 
the  overall  fit  statistics  ( G 2  =  5.91,  X2  =  6.29,  with  df  =  6)  do  not  indicate  problems.  In 
analyzing  residual  patterns,  we  should  be  cautious  about  attributing  patterns  to  what  might 
be  chance  variation  from  a  model. 

A  useful  graphical  display  for  showing  lack  of  fit  compares  sample  and  fitted  proportions 
by  plotting  them  against  each  other  or  by  plotting  both  of  them  against  explanatory  variables. 
For  the  linear  logit  model.  Figure  6.3  plots  both  the  sample  proportions  and  the  estimated 
probabilities  of  heart  disease  against  blood  pressure.  The  fit  seems  decent. 

Studying  residuals  helps  us  understand  either  why  a  model  fits  poorly  or  where  there  is 
lack  of  fit  in  a  generally  good-fitting  model.  The  next  example  illustrates  the  second  case. 

6.2.3  Example:  Admissions  to  Graduate  School  at  Florida 

Table  6.6  refers  to  graduate  school  applications  for  the  23  departments  in  the  College  of 
Liberal  Arts  and  Sciences  at  the  University  of  Florida  during  the  1997-1998  academic 
year.  It  cross-classifies  the  applicant’s  gender,  department  to  which  he  or  she  applied,  and 
whether  he  or  she  was  admitted,  which  we  treat  as  the  response  variable.  For  gender  i 
in  department  k,  let  y,*  denote  the  number  admitted  and  let  7T,a  denote  the  probability  of 
admission.  We  treat  {T,*}  as  independent  bin(w,<  ,  7T,*).  Other  things  being  equal,  we  would 
hope  the  admissions  decision  is  independent  of  gender.  The  model  with  no  gender  effect, 
given  the  department,  is 


logit(7T,0  =  a  +  j6°. 

However,  this  model  fits  rather  poorly  (G2  =  44.74,  X2  =  40.85,  df  =  23). 

The  software  output  in  Table  6.6  reports  standardized  residuals  {r, }  for  the  number  of 
females  who  were  admitted.  For  instance,  the  Astronomy  department  admitted  6  females, 
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Table  6.6  Graduate  School  Admissions  by  Gender  and  Department,  with  Standardized 
Residuals  for  Model  of  No  Gender  Effect 


Females 

Males 

Std .  Res 

Females 

Males 

Std.  Res 

Dept 

Yes 

No 

Yes 

No 

{Fern,  Yes) 

Dept 

Yes 

NO 

Yes 

NO 

(Fern,  Yes) 

anth 

32 

81 

21 

41 

-0.76 

ling 

21 

10 

7 

8 

1.37 

astr 

6 

0 

3 

8 

2 . 87 

math 

25 

18 

31 

37 

1 .29 

chem 

12 

43 

34 

110 

-0.27 

phil 

3 

0 

9 

6 

1 . 34 

clas 

3 

1 

4 

0 

-1.07 

phys 

10 

11 

25 

53 

1 .32 

comm 

52 

149 

5 

10 

-0.63 

poli 

25 

34 

39 

49 

-0.23 

comp 

8 

7 

6 

12 

1.16 

psyc 

2 

123 

4 

41 

-2.27 

engl 

35 

100 

30 

112 

0 . 94 

reli 

3 

3 

0 

2 

1.26 

geog 

9 

1 

11 

11 

2 .17 

roma 

29 

13 

6 

3 

0 . 14 

geol 

6 

3 

15 

6 

-0.26 

soci 

16 

33 

7 

17 

0 .30 

germ 

17 

0 

4 

1 

1 . 89 

stat 

23 

9 

36 

14 

-0.01 

hist 

9 

9 

21 

19 

-0.18 

zool 

4 

62 

10 

54 

-1.76 

lati 

26 

7 

25 

16 

1.65 

Source :  Data  courtesy  of  Prof.  James  Booth. 


which  was  2.87  estimated  standard  deviations  higher  than  the  model  predicted.  Each  depart¬ 
ment  has  only  a  single  nonredundant  standardized  residual,  because  of  marginal  constraints 
for  the  model.  The  model  has  fit  777*  =  (yu  +  \2k)/n+k,  corresponding  to  an  independence 
fit  (7fu  =  ft 2k)  in  each  partial  table.  Now, 

Ofi*  +^2*)  n2k  «i*  ,  „ 

V'U  -  n\k7ixk  -  y\k  -n\k - =  — y\k - yik  =  ~iyik  -  n2kn2k). 

n+k  n+k  n+k 

Thus,  standard  errors  of  (ytk  -  n\kA\k)  and  (>’2*  —  n2kft2k)  are  identical.  The  standardized 
residuals  are  identical  in  absolute  value  for  males  and  females  but  of  different  sign.  Astron¬ 
omy  admitted  3  males,  and  their  standardized  residual  was  —2.87;  the  number  admitted 
was  2.87  estimated  standard  deviations  lower  than  predicted. 

Having  a  single  nonredundant  value  r,  for  each  df  is  an  advantage  of  standardized 
residuals  over  Pearson  (or  deviance)  residuals.  The  model  of  conditional  independence  has 
df  =  1  for  each  partial  table.  Only  one  bit  of  information  exists  about  how  the  data  depart 
from  the  model,  yet  the  Pearson  residual  for  males  need  not  equal  the  Pearson  residual 
for  females  in  absolute  value.  The  {/-,  }  for  females  who  were  admitted  in  each  department 
satisfy  J2j=\  rf  —  -^2,  their  squares  giving  23  df  =  1  components  for  the  Pearson  statistic. 
The  46  squared  Pearson  residuals  would  have  the  same  sum,  but  each  has  null  distribution 
smaller  than  x2- 

Departments  with  large  standardized  residuals  reveal  the  reason  for  the  lack  of  fit. 
Significantly  more  females  were  admitted  than  the  model  predicts  in  the  Astronomy  and 
Geography  departments,  and  fewer  in  the  Psychology  department.  Without  these  three 
departments,  the  model  fits  reasonably  well  ( G 2  —  24.37,  X2  —  22.75,  df  =  20). 

For  the  complete  data,  adding  a  gender  effect  to  the  model  does  not  provide  an  improved 
fit  (G2  =  42.36,  X2  =  38.99,  df  =  22),  because  the  departments  just  mentioned  have 
associations  in  different  directions  and  of  greater  magnitude  than  other  departments.  This 
model  has  an  ML  estimate  of  1.19  for  the  gender  conditional  odds  ratio,  the  odds  of 
admission  being  19%  higher  for  females  than  males,  given  department.  By  contrast,  the 
marginal  table  collapsed  over  department  has  a  sample  odds  ratio  of  0.94,  the  overall  odds  of 
admission  being  6%  lower  for  females.  This  illustrates  Simpson’s  paradox  (Section  2.3.2), 
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the  estimated  conditional  association  having  different  direction  than  the  estimated  marginal 
association. 

6.2.4  Influence  Diagnostics  for  Logistic  Regression 

Other  regression  diagnostic  tools  are  also  helpful  in  assessing  fit.  These  include  plots  of 
ordered  standardized  residuals  against  normal  percentiles  (Haberman  1973a)  and  analyses 
that  describe  an  observation’s  influence  on  parameter  estimates  and  fit  statistics.  Whenever 
a  residual  indicates  that  a  model  fits  an  observation  poorly,  it  can  be  informative  to  delete  the 
observation  and  refit  the  model  to  remaining  ones.  This  is  equivalent  to  adding  a  parameter 
to  the  model  for  that  observation,  forcing  a  perfect  fit  for  it. 

For  ungrouped  binary  data,  the  notion  of  an  outlier  is  not  as  clear  as  in  ordinary  regression. 
Copas  (1988)  used  a  probabilistic  definition  whereby,  if  the  fitted  model  were  true,  the 
observation  would  be  very  unlikely  to  occur.  But  then,  if  A ;  is  close  to  1  or  close  to  0  over 
certain  regions  of  explanatory  variable  values,  it  is  not  at  all  surprising  to  observe  some 
outliers.  Copas  studied  how  various  models  differ  in  their  sensitivity  to  outliers. 

As  in  ordinary  regression,  a  single  observation  can  be  quite  influential  in  determining 
parameter  estimates.  The  greater  an  observation’s  leverage,  the  greater  its  potential  influ¬ 
ence.  The  fit  could  be  quite  different  if  an  observation  that  appears  to  be  an  outlier  on  y 
and  has  large  leverage  is  deleted.  However,  a  single  observation  can  have  a  much  more 
exorbitant  influence  in  ordinary  least-squares  regression  than  in  logistic  regression,  since 
ordinary  regression  has  no  bound  on  the  distance  of  y,  from  its  expected  value.  In  Section 
4.5.6  we  observed  that  the  GLM  estimated  hat  matrix 

Ha,  =  W'I2X(X'WX)-'X''W'12 

depends  on  the  fit  as  well  as  the  model  matrix  X.  For  logistic  regression,  recall  (from 
Section  5.5.2)  that  the  weight  matrix  W  is  diagonal  with  element  w,  —  /?, zr, ( 1  —  A ,)  for 
the  iij  observations  at  setting  /  of  predictors.  Points  that  have  extreme  predictor  values 
need  not  have  high  leverage.  In  fact,  the  leverage  can  be  relatively  small  if  Aj  is  close  to 
Oor  1. 

Several  measures  describe  the  effect  of  removing  an  observation  from  the  data  set.  They 
are  related  algebraically  to  the  observation’s  leverage  (Pregibon  1981,  Williams  1987).  In 
logistic  regression,  the  observation  could  be  a  single  binary  response  or  a  binomial  response 
for  a  set  of  subjects  all  having  the  same  predictor  values  (i.e.,  ungrouped  or  grouped  data). 
For  each  observation,  influence  measures  of  deleting  the  observation  include: 

1.  For  each  model  parameter,  the  change  in  its  estimate.  This  change,  divided  by  its 
standard  error,  is  called  Dfbeta. 

2.  A  measure  of  the  change  in  a  joint  confidence  interval  for  the  parameters.  This 
confidence  interval  displacement  diagnostic  is  denoted  by  c. 

3.  The  change  in  X2  or  G 2  goodness-of-fit  statistics.  Pregibon  (1982)  showed  that  the 
change  in  X2  approximates  the  squared  standardized  residual  for  that  observation. 

For  each  measure,  the  larger  the  value,  the  greater  the  influence.  With  continuous  or 
multiple  predictors,  it  can  be  informative  to  plot  these  diagnostics,  for  instance,  against  the 
estimated  probabilities. 


SUMMARIZING  THE  PREDICTIVE  POWER  OF  A  MODEL 


221 


Table  6.7  Diagnostic  Measures  for  Logistic  Regression  Models  Fitted  to  Heart  Disease  Data 


Blood 

Pressure 

Dfbeta 

c 

Pearson 
X2  Diff. 

Likelihood-Ratio 

G 2  Diff. 

Pearson  X2 
Diff." 

Likelihood-Ratio 
G 2  Diff." 

1 1 1.5 

0.49 

0.34 

1.22 

1.39 

6.86 

9.13 

121.5 

-1.14 

2.26 

5.64 

5.04 

0.02 

0.02 

131.5 

0.33 

0.31 

0.89 

0.94 

4.08 

4.56 

141.5 

0.08 

0.09 

0.33 

0.34 

0.55 

0.57 

151.5 

0.01 

0.00 

0.02 

0.02 

0.70 

0.66 

161.5 

-0.07 

0.02 

0.11 

0.11 

0.87 

0.80 

176.5 

0.40 

0.26 

0.42 

0.42 

14.17 

10.83 

191.5 

-0.12 

0.02 

0.03 

0.03 

9.41 

6.73 

"Independence  model;  other  values  refer  to  linear  logit  model  with  blood  pressure  predictor. 


We  illustrate  the  diagnostics  using  the  linear  logit  model  for  Table  6.5,  which  has  blood 
pressure  as  a  predictor  for  heart  disease.  Table  6.7  contains  simple  approximations  (due  to 
Pregibon  1981)  for  the  Dfbeta  measure  for  the  coefficient  of  blood  pressure,  the  confidence 
interval  diagnostic  c,  the  change  in  G2,  and  the  change  in  X 2  (which  is  the  square  of  the 
standardized  residual,  rf).  All  their  values  show  that  deleting  the  second  observation  has  the 
greatest  effect.  This  is  not  surprising,  as  that  observation  has  the  only  relatively  large  resid¬ 
ual.  By  contrast.  Table  6.7  also  contains  the  changes  in  X2  and  G 2  for  deleting  observations 
in  fitting  the  independence  model.  At  the  low  and  high  ends  of  the  blood  pressure  values, 
several  changes  are  very  large.  However,  these  all  relate  to  removing  an  entire  binomial 
sample  at  a  blood  pressure  level  instead  of  removing  a  single  subject’s  binary  observation. 
Such  subject-level  (ungrouped  data)  deletions  have  little  effect  even  for  this  model. 


6.3  SUMMARIZING  THE  PREDICTIVE  POWER  OF  A  MODEL 

In  ordinary  regression,  R 2  describes  the  reduction  in  the  conditional  variation  of  the  response 
compared  with  the  marginal  variation.  It  and  the  multiple  correlation  R  describe  how  well 
the  explanatory  variables  can  predict  the  response,  with  R  =  1  for  perfect  prediction. 
Despite  various  attempts  to  define  analogs  for  categorical  response  models,  no  proposed 
measure  is  as  widely  useful  as  R  and  R2.  In  this  section  we  present  a  few  ways  proposed 
for  summarizing  predictive  power. 

6.3.1  Summarizing  Predictive  Power:  R  and  R-Squared  Measures 

For  any  GLM,  the  correlation  R(y,  fL)  between  the  observed  responses  {y,}  and  the  model’s 
fitted  values  {fiLj }  measures  predictive  power.  For  least-squares  regression,  R  is  the  multiple 
correlation  between  Y  and  the  predictors.  An  advantage  of  the  correlation,  relative  to  its 
square,  is  the  appeal  of  working  on  the  original  scale  and  its  approximate  proportionality 
to  effect  size:  For  a  small  effect  with  a  single  predictor,  doubling  the  slope  corresponds 
approximately  to  doubling  R. 

In  logistic  regression  with  ungrouped  data,  ft ,  for  a  particular  model  is  the  estimated 
probability  jf,  for  binary  observation  /.  So,  R(y.  fi)  is  then  the  correlation  between  the 
n  binary  {y, )  observations  (1  or  0  for  each)  and  the  estimated  probabilities.  The  highly 
discrete  nature  of  y  can  suppress  the  range  of  possible  R  values.  Nevertheless,  R  is  useful  for 
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comparing  fits  of  different  models  for  the  same  data.  A  caveat  is  that  with  many  predictors 
the  R  estimates  can  become  highly  biased  upwards  in  estimating  the  true  correlation, 
R{Y ,  E{Y |X)),  so  it  can  be  misleading  to  compare  sample/?  values  for  models  with  greatly 
different  df  values.  A  jackknife  adjustment  can  reduce  this  bias  (Zheng  and  Agresti  2000). 

Another  way  to  measure  the  association  between  the  binary  responses  {y,}  and  their 
fitted  values  {if,  }  uses  the  proportional  reduction  in  squared  error 

.  _  TJjj  ~  jv  >2 

E/O',-  -  v)2  ’ 

obtained  by  using  if,  instead  of  y  =  ^  ■  y; / n  as  a  predictor  of  y,  (Efron  1 978).  Amemiya 
(1981)  suggested  a  related  measure  that  weights  squared  deviations  by  inverse  predicted 
variances.  For  logistic  regression,  unlike  normal  GLMs,  these  and  R(y,  ft)  need  not  be 
nondecreasing  as  the  model  gets  more  complex.  Like  any  correlation-type  measure,  they 
can  depend  strongly  on  the  range  of  observed  values  of  explanatory  variables,  and  as 
computed  for  sample  data  are  biased  upward  as  estimates  of  corresponding  population 
measures.  Bias  corrections  are  possible  (e.g.,  Liao  and  McGee  2003). 

6.3.2  Summarizing  Predictive  Power:  Likelihood  and  Deviance  Measures 

Other  measures  of  predictive  power  directly  use  the  likelihood  function.  Denote  the  maxi¬ 
mized  log  likelihood  by  L M  for  a  given  model,  L$  for  the  saturated  model,  and  L o  for  the 
null  model  containing  only  an  intercept  term.  Probabilities  are  no  greater  than  1.0,  so  log 
likelihoods  are  nonpositive.  As  the  model  complexity  increases,  the  parameter  space  ex¬ 
pands,  so  the  maximized  log  likelihood  increases.  Thus,  Lq  <  Lm  <  L$  <  0.  The  measure 


falls  between  0  and  1.  It  equals  0  when  the  model  provides  no  improvement  in  fit  over  the 
null  model,  and  it  equals  1  when  the  model  fits  as  well  as  the  saturated  model.  A  weakness 
is  that  the  log  likelihood  is  not  an  easily  interpretable  scale.  Interpreting  the  numerical  value 
is  difficult,  other  than  in  a  comparative  sense  for  different  models. 

For  A  independent  Bernoulli  observations,  the  maximized  log  likelihood  is 

N  N 

>°gn^o  -  */)i_vi] = XXV'  ■og'T + o  -  >v )  iog(  i  -  jtj )]. 

7=1  1=1 

The  null  model  gives  jf;  =  (£L  y,)/N  =  y,  so  that 

Lq  =  N[y(log  y)  +  (1  -  ,y)log(l  -  y)]. 

The  saturated  model  has  a  parameter  for  each  subject  and  implies  that  n =  y,  for  all  i. 
Thus,  L,s  —  0  and  (6.3)  simplifies  to 


D  = 


Lq  —  Lm 


Lq 


McFadden  (1974)  proposed  this  measure. 
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Suppose  there  are  multiple  observations  at  each  setting  of  explanatory  variables.  Then, 
the  data  file  can  take  the  grouped-data  form  of  N  binomial  counts  with  binomial  indices 
{«,},  rather  than  the  ungrouped  form  of  N  Bernoulli  indicators  each  with  n,  =  \  .  The 
saturated  model  then  has  a  parameter  for  each  count.  It  gives  N  fitted  proportions  equal  to 
the  N  sample  proportions  of  success.  Then  Ls  is  nonzero  and  (6.3)  takes  a  different  value 
than  when  calculated  using  individual  subjects.  For  N  binomial  counts,  the  maximized 
likelihoods  are  related  to  the  G 2  goodness-of-fit  statistic  by  G2(M)  =  —2(Lm  —  Ls),  so 
(6.3)  becomes 


G\0)-G\M) 

G2(  0) 

Goodman  (1971a)  and  Theil  (1970)  discussed  this  and  related  partial  association 
measures. 

With  grouped  data  D*  can  be  large  even  when  predictive  power  is  weak  at  the  subject 
level.  For  instance,  a  model  can  fit  much  better  than  the  null  model  even  though  fitted 
probabilities  are  close  to  0.50  for  the  entire  sample.  In  particular,  D*  =  1  when  it  fits 
perfectly,  regardless  of  how  well  one  can  predict  individual  subjects’  responses  on  Y  with 
that  model.  Also,  suppose  that  the  population  satisfies  the  given  model,  but  not  the  null 
model.  As  the  sample  size  n  —  n,  increases  with  number  of  settings  N  fixed,  G2(M) 
behaves  like  a  chi-squared  random  variable  but  G2(0)  eventually  grows  unboundedly.  Thus, 
D*  ->  1  (in  probability)  as  n  ->  oo,  and  its  magnitude  tends  to  depend  on  n.  This  measure 
confounds  model  goodness  of  fit  with  predictive  power.  Similar  behavior  occurs  for  R2 
in  regression  analyses  when  calculated  using  means  of  y  values  (rather  than  individual  y 
values)  at  N  different  x  settings.  It  is  more  sensible  to  use  D  for  binary,  ungrouped  data. 

6.3.3  Summarizing  Predictive  Power:  Classification  Tables 

A  classification  table  cross-classifies  the  binary  response  with  a  prediction  of  whether  y  =  0 
or  1.  The  prediction  for  observation  i  is  y  —  1  when  7f,  >  ttq  and  y  =  0  when  ft,  <  7To, 
for  some  cutoff  Kq.  One  possibility  is  7To  =  0.50.  Another  is  the  sample  proportion  of  1 
outcomes,  which  is  ft,  for  the  model  containing  only  an  intercept  term.  Rather  than  using 
7f,  from  the  model  fitted  to  the  data  set  of  which  y,  was  one  element,  it  is  better  to  make 
the  prediction  with  the  “leave-one-out”  cross-validation  approach  by  which  ft is  based  on 
the  model  fitted  to  the  other  n  —  1  observations. 

Using  a  classification  table,  we  can  summarize  the  predictive  power  by 

sensitivity  =  P{y  =  l|y  =  1)  and  specificity  =  P(y  =  0|y  =  0). 

(Recall  Section  2.1.3.)  An  overall  summary  of  predictor  power  is  the  proportion  of  correct 
classifications.  This  estimates 

P (correct  classification)  =  P(y  =  1  and  y  =  1)  +  P(y  =  0  and  y  =  0) 

=  P(S=l\y=  I  )P(y  =\)  +  P(y  =  0\y  =  0  )P(y  =  0), 


which  is  a  weighted  average  of  sensitivity  and  specificity. 

A  classification  table  has  limitations:  It  collapses  continuous  predictive  values  ft  into 
binary  ones.  The  choice  of  jiq  is  arbitrary.  Results  are  sensitive  to  the  relative  numbers  of 
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times  that  y  =  1  and  y  =  0.  For  example,  if  a  low  proportion  of  observations  have  y  =  1, 
the  model  fit  may  never  have  ft,  >  0.50,  in  which  case  one  never  predicts  y  =  1.  Again, 
the  main  use  is  for  comparing  different  models  with  the  same  data. 


6.3.4  Summarizing  Predictive  Power:  ROC  Curves 

The  classification  table  summaries  depend  on  the  cutoff  7r«  for  making  classifications. 
A  receiver  operating  characteristic  (ROC)  curve  is  a  plot  of  sensitivity  as  a  function  of 
(1  —  specificity)  for  the  possible  no .  A  ROC  curve  is  more  informative  than  a  classification 
table,  because  it  summarizes  predictive  power  for  all  possible  tto.  When  7To  is  near  0,  almost 
all  predictions  are  y  =  1;  then,  sensitivity  is  near  1,  specificity  is  near  0,  and  the  point  (1  — 
specificity,  sensitivity)  ==»  (1,  1).  When  no  is  near  1,  almost  all  predictions  are  y  =  0;  then, 
sensitivity  is  near  0,  specificity  is  near  1,  and  (1  —  specificity,  sensitivity)  %  (0,  0).  A  ROC 
curve  usually  has  a  concave  shape  connecting  the  points  (0,  0)  and  (1,  1). 

For  a  given  specificity,  better  predictive  power  corresponds  to  higher  sensitivity.  So,  the 
better  the  predictive  power,  the  higher  the  ROC  curve.  In  a  summary  sense,  the  greater 
the  area  under  the  ROC  curve,  the  better  the  predictions.  In  fact,  the  area  under  a  ROC 
curve  is  identical  to  the  value  of  another  measure  of  predictive  power,  the  concordance 
index  (Hanley  and  McNeil  1982).  Consider  all  pairs  of  observations  (i,j)  for  which  y,  =  1 
and  yj  =  0.  The  concordance  index  c  is  the  proportion  of  such  pairs  for  which  ft,  >  ftp, 
that  is,  it  is  the  relative  frequency  of  the  pairwise  predictions  and  the  outcomes  being 
concordant,  the  observation  with  the  larger  y  also  having  the  larger  ft.  A  value  c  —  0.50 
means  predictions  are  no  better  than  random  guessing.  This  corresponds  to  a  model  having 
only  an  intercept  term  and  an  ROC  curve  that  is  a  straight  line  connecting  points  (0,  0) 
and  (1,1). 


6.3.5  Example:  Evaluating  Predictive  Power  for  Horseshoe  Crab  Data 

Table  6.2  shows  the  correlation  R(y,  ft)  for  some  models  fitted  to  the  horseshoe  crab  data  for 
predicting  whether  a  female  crab  had  at  least  one  satellite.  Color  alone  (C)  has  R  —  0.285, 
width  alone  (IT)  has  R  =  0.402,  and  using  both  (C  +  IT)  increases  R  to  0.452.  The  simpler 
model  (C  =  dark  +  IT)  that  uses  color  as  binary  merely  to  indicate  whether  a  crab  is  dark 
does  nearly  as  well,  with  R  =  0.447.  These  models  fit  essentially  as  well  as  more  complex 
models  not  shown  in  the  table.  For  example,  the  model  that  adds  an  interaction  term  to  the 
model  (C  =  dark  +  IT)  has  R  =  0.452. 

Other  measures  of  predictive  power  have  different  magnitudes  but  similar  results  in 
comparing  various  models.  For  example,  the  concordance  index  c  =  0.639  with  model  (C) 
(in  factor  form),  0.742  with  model  (IT),  0.771  with  model  (C  +  IT),  0.772  with  model 
(C  =  dark  +  IT),  and  0.772  for  the  model  that  adds  an  interaction  term  to  this  model. 

Next,  we  illustrate  a  classification  table,  for  the  model  (C  +  W).  Of  the  173  crabs, 
1 1  1  had  a  satellite,  for  a  sample  proportion  of  0.642.  Table  6.8  shows  classification  tables 
using  7Tq  =  0.50  and  tiq  —  0.642  with  cross- validated  predictions.  When  no  =  0.642,  from 
Table  6.8  the  estimated  sensitivity  =  74/1 1 1  =  0.667  and  specificity  =  42/62  =  0.677.  The 
proportion  of  correct  classifications  is  (74  +  42)/l  73  =  0.671 . 

Figure  6.4  shows  how  PROC  LOGISTIC  in  SAS  reports  the  ROC  curve  for  the  model 
( C  +  IT).  When  Tto  =  0.642,  specificity  =  0.68,  sensitivity  =  0.67,  and  the  point  plotted 
has  coordinates  (0.32,  0.67).  The  area  under  the  curve  is  c  —  0.771. 
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Table  6.8  Classification  Tables  for  Horseshoe  Crab  Mating  Data 


Prediction, 

Prediction, 

ZTO 

=  0.642 

rrn 

=  0.500 

Actual 

y  =  i 

'■<> 

II 

o 

i>=  1 

c 

II 

<>\ 

Total 

>•=  1 

74 

37 

94 

17 

III 

>•=<) 

20 

42 

34 

28 

62 

ROC  Curve  for  Model 
Area  Under  the  Curve  =  0.7714 


1  -  Specificity 

Figure  6.4  ROC  curve  (from  SAS  PROC  LOGISTIC)  for  logistic  regression  model  estimating  the  probability 
a  crab  has  satellites,  using  width  and  color  predictors. 


6.4  MANTEL-HAENSZEL  AND  RELATED  METHODS  FOR 
MULTIPLE  2x2  TABLES 

The  analysis  of  the  graduate  admissions  data  in  Section  6.2.3  used  the  model  of  conditional 
independence.  This  model  is  an  important  one  in  biomedical  studies  dial  investigate  whether 
an  association  exists  between  a  treatment  variable  and  a  disease  outcome  after  adjusting  for 
a  possibly  confounding  variable  that  might  influence  that  association.  We  next  present  the 
test  of  conditional  independence  as  a  logistic  model  analysis  for  a  2  x  2  x  K  contingency 
table.  We  also  present  a  lest  and  a  related  estimation  method,  due  to  Mantel  and  Haenszel 
(1959),  that  seem  non-model-based  but  relate  to  the  same  logistic  model. 

We  illustrate  using  Table  6.9,  showing  results  of  a  clinical  trial  with  eight  centers.  The 
study  compared  two  cream  preparations,  an  active  drug  and  a  control,  on  their  success  in 
curing  an  infection.  This  table  illustrates  a  common  pharmaceutical  application,  comparing 
two  treatments  on  a  binary  response  with  observations  from  several  strata.  The  strata  are 
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Table  6.9  Clinical  Trial  Relating  Treatment  to  Response  for  Eight  Centers,  with  Expected 
Value  and  Variance  (of  Success  Count  for  Drug)  Under  Conditional  Independence 


Center 

Treatment 

Response 

Success  Failure 

Odds  Ratio 

Miu 

var  (nnk) 

1 

Drug 

11 

25 

1.19 

10.36 

3.79 

Control 

10 

27 

2 

Drug 

16 

4 

1.82 

14.62 

2.47 

Control 

22 

10 

3 

Drug 

14 

5 

4.80 

10.50 

2.41 

Control 

7 

12 

4 

Drug 

2 

14 

2.29 

1.45 

0.70 

Control 

I 

16 

5 

Drug 

6 

1 1 

oo 

3.52 

1.20 

Control 

0 

12 

6 

Drug 

I 

10 

oo 

0.52 

0.25 

Control 

0 

10 

7 

Drug 

I 

4 

2.0 

0.71 

0.42 

Control 

1 

8 

8 

Drug 

4 

2 

0.33 

4.62 

0.62 

Control 

6 

1 

Source:  Beitler  and  Landis  (1985). 


often  medical  centers  or  clinics;  or,  they  may  be  levels  of  age  or  severity  of  the  condition 
being  treated;  or,  they  may  be  combinations  of  levels  of  several  control  variables;  or,  they 
may  be  different  studies  of  the  same  sort  summarized  in  a  meta-analysis. 

6.4.1  Using  Logistic  Models  to  Test  Conditional  Independence 

For  a  binary  response  Y,  we  analyze  the  effect  of  a  binary  predictor  X,  conditional  on 
the  category  of  a  qualitative  covariate  Z.  Let  7T,*  =  P(Y  =  1  \X  =  i,  Z  —  k).  Consider  the 
model 

logit(7r,*)  —  a  +  fix,  +  ftf ,  (  =  1,2,  k  —  (6.4) 

where  x\  =  1  and  xj  =  0.  This  model  assumes  that  the  AY  conditional  odds  ratio  is  the  same 
at  each  category  of  Z,  namely,  exp(/3 ).  The  null  hypothesis  of  XY  conditional  independence 
is  Hq:  ft  =  0.  The  Wald  statistic  is  (ft/SE)2.  The  likelihood-ratio  statistic  is  the  difference 
between  deviance  statistics  for  the  reduced  model 

logit(7T,i)  =  a  +  ftf  (6.5) 

and  the  full  model.  These  tests  are  sensible  when  X  has  a  similar  effect  at  each  category  of 
Z.  They  have  df  =  1 . 

Alternatively,  since  the  reduced  model  (6.5)  is  equivalent  to  conditional  independence 
of  X  and  Y,  we  can  test  conditional  independence  using  a  goodness-of-fit  test  of  that 
model.  Such  a  test  has  df  =  K  when  X  is  binary.  This  corresponds  to  comparing  model 
(6.5)  and  the  saturated  model,  which  permits  ft  ^  0  in  (6.4)  and  also  contains  (K  —  1)  XZ 


MANTEL-HAENSZEL  AND  RELATED  METHODS  FOR  MULTIPLE  2x2  TABLES 


227 


interaction  parameters.  The  likelihood-ratio  test  statistic  partitions  into  two  components,  the 
likelihood-ratio  statistic  with  df  =  1  for  testing  Hq:  =  0  in  model  (6.4)  and  the  likelihood- 
ratio  statistic  with  df  =  (K  —  1 )  for  testing  the  fit  of  model  (6.4)  and  thus  equality  of  the  K 
odds  ratios  (Goodman  1969,  Cheng  et  al.  2010). 

When  no  interaction  exists  or  when  the  conditional  XY  association  has  relatively  little 
variation  among  the  levels  of  Z,  it  follows  from  results  in  Section  5.3.7  that  the  approach 
using  df  =  K  of  testing  conditional  independence  is  less  powerful,  especially  when  K  is 
large.  When  model  (6.4)  holds,  both  tests  have  the  same  noncentrality.  Thus,  the  test  of 
f5  =  0  in  model  (6.4)  is  more  powerful,  since  it  has  fewer  degrees  of  freedom.  However, 
when  the  direction  of  the  conditional  XY  association  varies  among  categories  of  Z,  it  can 
be  less  powerful. 

6.4.2  Cochran-Mantel-Haenszel  Test  of  Conditional  Independence 

Mantel  and  Haenszel  (1959)  proposed  a  non-model-based  test  of  Ho'.  conditional  inde¬ 
pendence  in  2  x  2  x  K  tables.  Focusing  on  retrospective  studies  of  disease,  they  treated 
response  (column)  marginal  totals  as  fixed.  Thus,  in  each  partial  table  k  of  cell  counts 
{«,,*},  their  analysis  conditioned  on  both  the  treatment  (e.g.,  group)  totals  {«]+*,  n2+k) 
and  the  response  outcome  totals  {«+u,  «+2a  )-  The  usual  sampling  schemes  then  yield  a 
hypergeometric  distribution  (3.17)  for  the  first  cell  count  n\\k  in  each  partial  table.  That 
count  determines  {n\2k,  «2t*>  «22 *},  given  the  marginal  totals. 

Under  Ho,  the  hypergeometric  mean  and  variance  of  are 


M it*  =  E(nuk)  -  n\+kn+\k/n++k, 
var(n,u)  =  nx+kn2+kn+\kn+2k/[n2++k(n++k  -  1)]. 


Cell  counts  from  different  partial  tables  are  independent.  The  test  statistic  combines  infor¬ 
mation  from  the  K  tables  by  comparing  J2k  wn*  to  its  null  expected  value.  It  equals 


Cmh  =  -  Mm)]2 

£*  var (ram-) 


(6.6) 


This  statistic  has  a  large-sample  chi-squared  null  distribution  with  df  =  1 . 

When  the  odds  ratio  9xY(k)  >  1  in  partial  table  k,  we  expect  that  («!U  -  /*m)  >  0.  When 
0XY(k)  >  1  in  every  partial  table  or  Oxru)  <  1  in  each  table,  —  /tin*)  tends  to  be 

relatively  large  in  absolute  value.  This  test  works  best  when  the  conditional  XY  association 
is  similar  in  each  partial  table.  In  this  sense  it  is  similar  to  the  tests  of  //0:  /J  =  0  in  logistic 
model  (6.4).  When  the  sample  sizes  in  the  strata  are  moderately  large,  this  test  usually  gives 
similar  results.  In  fact,  it  is  a  score  test  of  H0:  =  0  in  that  model  (Birch  1964b,  1965, 

Darroch  1981,  Day  and  Byar  1979). 

Cochran  (1954)  proposed  a  similar  test  statistic.  He  treated  the  rows  in  each  2x2  table 
as  two  independent  binomials  rather  than  a  hypergeometric.  Cochran's  statistic  is  (6.6)  with 
var(/t|  i*)  replaced  by 


var(flm)  =  wi+*/I2+*«-h*h+2*/h++*- 

Because  of  the  similarity  in  their  approaches,  we  call  (6.6)  the  Cochran-Mantel-Haenszel 
(CMH)  statistic.  The  Mantel  and  Haenszel  approach  using  the  hypergeometric  is  more 
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general  in  that  it  also  applies  to  some  cases  in  which  the  rows  are  not  independent  binomial 
samples  from  two  populations.  Examples  are  (1)  retrospective  studies  and  (2)  randomized 
clinical  trials  with  the  available  subjects  (usually  volunteers)  randomly  allocated  to  two 
treatments.  In  the  first  case  the  column  totals  are  naturally  fixed.  In  the  second,  under  the 
null  hypothesis  the  column  margins  are  the  same  regardless  of  how  subjects  are  assigned  to 
treatments,  and  randomization  arguments  lead  to  the  hypergeometric  for  each  2x2  table. 

Mantel  and  Haenszel  (1959)  proposed  (6.6)  but  with  a  continuity  correction.  The 
M- value  from  the  test  then  better  approximates  an  exact  conditional  test,  based  directly 
on  the  convolution  of  the  hypergeometric  distributions  rather  than  the  chi-squared  approx¬ 
imation  (Section  7.3.5).  However,  that  test  tends  to  be  conservative.  Mantel  and  Fleiss 
(1980)  stated  that  the  asymptotic  approximation  for  this  test  is  adequate  if  the  potential 
values  for  XX”  m-  —  Ml  it),  for  the  fixed  margins  in  each  2x2  table,  can  exceed  ±5.  The 
CMH  statistic  generalizes  for  I  x  J  x  K  tables  (Section  8.4.3). 

6.4.3  Example:  Multicenter  Clinical  Trial  Revisited 

For  the  multicenter  clinical  trial  introduced  at  the  beginning  of  Section  6.4,  Table  6.9  reports 
the  sample  odds  ratio  for  each  table  and  the  expected  value  and  variance  of  the  number  of 
successes  for  the  drug  treatment  («m)  under  Ho'-  conditional  independence.  In  each  table 
except  the  last,  the  sample  odds  ratio  shows  a  positive  association.  Thus,  it  makes  sense  to 
combine  results  using  CMH  =  6.38,  with  df  =  I.  There  is  considerable  evidence  against 
H0  (P  —  0.012). 

Similar  results  occur  in  testing  Ho'-  ft  =  0  in  logistic  model  (6.4).  The  model  fit  has 
$  =  0.777  with  SE  =  0.307.  The  Wald  statistic  is  (0.777/0.307)2  =  6.42  (P  =  0.01 1). 
The  likelihood-ratio  statistic  equals  6.67  (P  =  0.010). 

6.4.4  CMH  Test  Is  Advantageous  for  Sparse  Data 

In  summary,  for  the  main-effects  logistic  model  (6.4).  the  CMH  statistic  is  the  score  statistic 
alternative  to  the  likelihood-ratio  or  Wald  test  of  Ho'-  =  0.  As  n  -»  oo  with  fixed  K,  all 
three  tests  have  the  same  asymptotic  chi-squared  behavior  under  Ho-  An  advantage  of 
the  CMH  statistic  is  that  its  chi-squared  limit  also  applies  with  an  alternative  asymptotic 
scheme  in  which  K  — >  oo  as  n  -»  oo.  The  asymptotic  theory  for  likelihood-ratio  and  Wald 
tests  requires  the  number  of  parameters  (and  hence  K)  to  be  fixed,  so  it  does  not  apply  to 
this  scheme. 

Here  is  an  application  of  this  type:  Suppose  each  stratum  has  a  single  matched  pair  of 
subjects,  one  in  each  group.  Then,  «|+*  =  tij+k  =  1  for  each  k  and  n  —  2K ,  so  K  -»  oo 
as  n  -*■  oo.  Table  6.10  shows  the  data  layout  for  this  situation.  When  both  subjects  in 
stratum  k  make  the  same  response,  as  in  the  first  case  in  Table  6. 1 0,  n+tk  =  0  or  n  .  jk  =  0. 
Given  the  marginal  counts,  the  internal  counts  are  then  completely  determined,  and  f.i\ ^  = 
«m  and  var(«m)  =  0.  When  the  subjects  make  differing  responses,  as  in  the  second 


Table  6.10  Two  Examples  of  a  Stratum  Containing  a  Matched  Pair 


Response 

Response 

Element  of  Pair 

Success  Failure 

Success 

Failure 

First 

I  0 

1 

0 

Second 

1  0 

0 

1 
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case,  «+u  =  n+2k  =  1,  so  that  p\\k  =  0.50  and  var(«m)  =  0.25.  Thus,  a  matched  pair 
contributes  to  the  CMH  statistic  only  when  the  two  subjects'  responses  differ.  Let  K* 
denote  the  number  of  the  K  tables  that  satisfy  this.  Although  each  /?m  can  take  only  two 
values,  the  central  limit  theorem  implies  that  J^k  n\\t  >s  approximately  normal  for  large 
K * .  Then,  the  distribution  of  CMH  is  approximately  chi-squared. 

Usually,  when  K  grows  with  n ,  each  stratum  has  few  observations,  so  the  full  table  is 
sparse.  There  may  be  more  than  two  observations,  such  as  case-control  studies  that  match 
several  controls  with  each  case.  The  nonstandard  setting  in  which  K  ->  oo  as  n  -*■  oo 
is  called  sparse-data  asymptotics.  Ordinary  ML  estimation  then  breaks  down  because  the 
number  of  parameters  is  not  fixed,  instead  having  the  same  order  as  the  sample  size.  In 
particular,  the  chi-squared  approximation  is  good  for  the  likelihood-ratio  and  Wald  statistics 
for  testing  conditional  independence  when  K  is  fixed  and  small  relative  to  n  and  the  strata 
marginal  totals  mostly  exceed  about  5  to  10. 


6.4.5  Estimation  of  Common  Odds  Ratio 

It  is  more  informative  to  estimate  the  strength  of  association  than  to  test  hypotheses  about 
it.  When  the  association  seems  stable  among  partial  tables,  we  can  combine  the  K  sample 
odds  ratios  into  a  summary  measure  of  conditional  association.  The  logistic  model  (6.4) 
implies  homogeneous  association,  0\y{ o  =  •  •  •  =  9xy(K >  =  exp(/6).  The  ML  estimate  of 
the  common  odds  ratio  is  exp(/3). 

Other  estimators  of  a  common  odds  ratio  are  not  model-based.  Woolf  (1955)  proposed 
an  exponentiated  weighted  average  of  the  K  sample  log  odds  ratios.  Let  pu\k  —  n,- ,•*/«++*. 
Mantel  and  Haenszel  (1959)  proposed 

A  £*(«ll*fl22*/n  +  +*)  Y,kn++kPW\kP22\k  ,, 

flMH  -  - T - -  =  ^ - •  (6.7) 

l^k(nmn2\k/n  +  +k)  Z_k  n++kP\2\kP2\\k 

This  gives  more  weight  to  strata  with  larger  sample  sizes.  With  fixed  K,  log(0MH)  is  slightly 
less  efficient  than  the  ML  estimator  /3  unless  /3  =  0  (Tarone  et  al.  1983).  However,  it  is 
preferred  over  the  ML  estimator  when  K  is  large  and  the  data  are  very  sparse.  The  ML 
estimator  $  of  the  log  odds  ratio  then  tends  to  be  too  large  in  absolute  value.  For  sparse-data 
asymptotics  with  only  a  single  matched  pair  in  each  stratum,  for  instance,  0  2/6.  (see 

Exercise  1 1 .29.) 

Robins  et  al.  (1986)  derived  an  estimated  variance  for  log(0Mn)  that  applies  both  for 
standard  asymptotics  with  large  n  and  fixed  K  and  for  sparse-data  asymptotics  in  which  K  is 
also  large.  Expressing  6Mh  =  R/S  =  (£*  Rk)  /  (L«  sk)  with  Rk  -  «iu«22* /«++*,  their 
derivation  showed  that  (log#MH  —  log  0)  is  approximately  proportional  to  ( R  —  6S).  They 
also  showed  that  E{R  —  6S)  =  0  and  derived  the  variance  of  (R  —  OS).  Their  result  is 


<7-[log6»MH] 


2R- 


^2  n+'+k(n  iu  +  n22k)Rk 


+  2S2  ^2  n ++^n +  n2l*)S* 
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For  the  eight-center  clinical  trial  summarized  by  Table  6.9, 

§  _  _  (11  x  27)/73  +  •  •  ■  +  (4  x  1)/13 

MH  Ek(nmnm/n++k)  (25  x  10)/73  +  •  ■  •  +  (2  x  6)/13 

For  log^MH  =  0.758,  o  [log  $mh]  =  0.303.  A  95%  confidence  interval  for  the  common 
odds  ratio  is  exp(0.758  ±  1.96  x  0.303)  or  (1.18,  3.87).  Similar  results  occur  using  model 
(6.4).  The  95%  confidence  interval  for  exp(yS)  is  exp(0.777  ±  1.96  x  0.307),  or  (1.19, 
3.97),  using  the  Wald  interval,  and  (1 .20, 4.02)  using  the  likelihood-ratio  interval.  Although 
the  evidence  of  an  effect  is  considerable,  inference  about  its  size  is  rather  imprecise  with 
such  a  small  sample.  The  odds  of  success  may  be  as  little  as  20%  higher  with  the  drug,  or 
they  may  be  as  much  as  four  times  as  high. 

If  the  true  odds  ratios  are  not  identical  but  do  not  vary  much,  #mh  still  is  a  useful  summary 
of  the  conditional  associations.  Similarly,  the  CMH  test  is  a  powerful  summary  of  evidence 
against  Hq:  conditional  independence,  as  long  as  the  sample  associations  fall  primarily  in 
a  single  direction.  It  is  not  necessary  to  assume  equality  of  odds  ratios  to  use  the  CMH  test 
or  0mh- 


6.4.6  Meta-analyses  for  Summarizing  Multiple  2x2  Tables 

A  meta-analysis  is  a  statistical  analysis  that  combines  information  from  several  studies. 
For  comparing  two  treatments  on  a  binary  response,  the  analysis  refers  to  a  2  x  2  x  K 
table,  one  2x2  table  for  each  study.  For  a  particular  effect  measure,  such  as  the  odds 
ratio  or  a  difference  of  proportions,  here  we  consider  the  simplifying  assumption  that  the 
population  values  of  the  measure  are  identical  in  each  study.  This  is  usually  unrealistic,  but 
is  often  adequate  for  providing  a  simple  summary  of  the  effect  when  the  true  effect  does  not 
vary  much  among  studies.  Sections  6.4. 10  and  13.3.6  generalize  to  allow  for  heterogeneity 
among  the  effects. 

Consider  first  the  significance  test  of  the  null  hypothesis  of  no  effect,  that  is,  conditional 
independence  between  the  treatment  and  the  response  for  each  study.  The  logistic  model 
(6.4)  is  a  natural  one  for  such  an  analysis.  We  test  Hq\  ft  =  0  using  the  likelihood-ratio 
test  or  the  Cochran-Mantel-Haenszel  (CMH)  test  (6.6).  As  mentioned  in  Section  6.4.4,  the 
CMH  test  is  advantageous  for  highly  sparse  data.  When  asymptotics  are  unsuitable  even  for 
that  test,  we  can  use  a  small-sample  generalization  of  Fisher’s  exact  test  to  multiple  2x2 
tables,  as  presented  in  Section  7.3.5.  For  the  CMH  test  or  for  the  small-sample  test,  tables 
for  which  there  are  either  no  successes  or  no  failures  provide  no  information  about  whether 
there  is  truly  an  association  and  make  no  contribution  to  the  test.  (Recall  that  Section  6.4.4 
discussed  this  for  matched  pairs.)  There  is  no  reason  to  use  some  device  such  as  adding 
a  small  constant  to  cells  of  the  table  so  those  tables  enter  the  analysis,  because  they  are 
uninformative  about  the  odds  ratio  (Agresti  and  Hartzel  2000). 

Consider  next  summarizing  the  size  of  the  effect.  For  the  logistic  model  (6.4),  we  can  use 
the  ML  estimate  of  the  odds  ratio  exp(/I)  and  a  corresponding  confidence  interval.  For  highly 
sparse  data,  we  can  instead  use  the  Mantel-Haenszel  estimate  $mh  and  its  corresponding 
interval.  A  small-sample  interval  can  guarantee  a  lower  bound  for  the  coverage  probability 
(Section  16.6.6).  For  all  such  frequentist  analyses,  tables  for  which  there  are  either  no 
successes  or  no  failures  provide  no  information  about  the  size  of  the  common  odds  ratio 
and  do  not  contribute  to  the  estimate. 
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6.4.7  Meta-analyses  for  Multiple  2x2  Tables:  Difference  of  Proportions 

The  difference  of  proportions  and  the  relative  risk  are  alternative  effect  measures  that  are 
simpler  to  interpret  than  the  odds  ratio.  A  common  difference  of  proportions  for  each  study 
is  the  parameter  S  in  a  model 


TCjk  =  a  +  &x,  -f  pf ,  i  =  1,2,  k  —  1,  . . . ,  K, 

that  replaces  the  logit  link  in  model  (6.4)  by  the  identity  link. 

Mantel-Haenszel-type  estimates  are  also  available  for  such  measures.  In  stratum  k, 
denote  the  binomial  “success”  counts  by  s *  =  n\ \k  and  tk  =  m\k  based  on  sample  sizes 
mk  =  n\+k  and  nk  =  «2-hu  and  let  N k  =  mk  +  nk.  With  wk  =  mknk/Nk,  the  estimator  of  a 
common  difference  of  proportions  is  the  weighted  average  of  the  stratum-specific  estimates 
h  =  [( sk/mk )  -  (tk/nk)], 


■5'mh  =  /  (^2 


wk 


(Greenland  and  Robins  1985).  An  estimated  variance, 


with 


<t20$mh)  = 


Pk  =  [mjtk  -  n2ksk  +  mknk{nk  -  mk)/2]INk, 
Qk  =  [sk(nk  -  tk)  +  tk(mk  -  sk)]/2Nk, 


(6.8) 


applies  under  both  standard  and  sparse-data  asymptotics  (Sato  1989). 

Under  standard  asymptotics,  the  ML  model-based  estimator  is  preferred  because  it  is 
more  efficient.  However,  ML  fitting  difficulties  often  arise  when  both  probabilities  are 
near  0  or  near  1,  and  the  {jr,*}  must  be  constrained  to  fall  between  0  and  1.  Here  is 
an  alternative  approach  that  is  then  asymptotically  efficient  and  does  not  have  boundary 
problems:  Express  the  score  or  profile  likelihood  100(1  —  a)%  confidence  interval  for  the 
difference  of  proportions  (see  Section  3.2.5)  for  study  k  alone  as  dk  ±  za/2Sk,  where  dk  is 
the  midpoint  of  that  interval  (i.e.,  not  the  sample  difference  of  proportions  Sk)  and  sk  is  a 
“pseudo  standard  error”  that  is  taken  to  be  the  width  of  the  interval  divided  by  2za/2- Then, 
taking  weight  wk  =  [1 /($£)]/[ JA  l/(s?)],  we  form  8  =  Y,k  SE  =  E*  VC*!)]-172- 
and  the  summary  interval  8  ±  za/2(SE).  Unlike  Wald  methods,  this  does  not  require  using 
unreliable  sample  standard  errors  from  each  study  but  merely  uses  a  midpoint  and  width 
based  on  information  obtained  from  the  likelihood  function. 

To  illustrate,  the  eight-center  clinical  trial  data  of  Table  6.9  was  analyzed  in  Sections  6.4.3 
and  6.4.5  with  CMH  methods  and  with  logistic  model  (6.4).  For  summarizing  the  effect  by  a 
common  difference  of  success  proportions  between  drug  and  control,  the  Mantel-Haenszel- 
type  estimate  (6.8)  is  <5mh  =  0.1 30  with  SE  =  0.050.  Using  the  alternative  method  just  men¬ 
tioned  that  combines  information  from  the  eight  center-specific  score  confidence  intervals, 
we  get  <5  =  0.128,  SE  =  0.049,  and  a  95%  confidence  interval  for  a  common  difference  of 
proportions  of  (0.032,  0.224). 
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For  the  difference  of  proportions,  tables  for  which  there  are  either  no  successes  or  no 
failures  provide  no  information  about  whether  the  true  common  value  <5  is  nonzero  (i.e., 
the  significance  testing  problem)  but  they  do  give  information  about  the  magnitude  of  the 
effect.  If  each  treatment,  for  example,  has  a  very  large  number  of  failures  and  no  successes, 
then  we  have  evidence  that  both  population  proportions  are  close  to  0  and  that  the  difference 
is  small.  Thus,  such  data  do  have  an  impact  on  practical  significance.  (See  Exercise  6.33 
for  an  illustration.) 

Agresti  and  Hartzel  (2000)  discussed  ways  of  summarizing  information  from  multiple 
tables  and  gave  many  additional  references.  Tian  et  al.  (2009)  proposed  an  alternative 
approach  designed  for  small-sample  cases  in  which  some  centers  may  have  no  outcomes 
of  a  particular  type. 

6.4.8  Collapsibility  and  Logistic  Models  for  Contingency  Tables 

We  have  seen  that  conditional  associations  in  partial  tables  usually  differ  from  marginal 
associations.  Under  certain  collapsibility  conditions  given  in  Section  2.3.6,  however,  they 
are  the  same.  For  odds  ratios,  recall  that  for  three-way  tables,  XY  marginal  and  conditional 
odds  ratios  are  identical  if  either  Z  and  X  are  conditionally  independent  or  if  Z  and  Y  are 
conditionally  independent. 

For  instance,  suppose  that  a  clinical  trial  studies  the  association  between  a  binary 
treatment  variable  X  (x\  =  1,  Xi  =  0)  and  a  binary  response  T,  using  data  from  K  centers 
(Z).  The  logistic  model  (6.4),  namely, 


logit(jr,i)  =  a  +  fix,  +  /if ,  /  =  1 , 2,  k  =  1, . . . ,  K,  (6.9) 

has  the  same  treatment  effect  /)  for  each  center.  Since  the  model  has  no  restriction  on  the 
conditional  association  of  Z  with  X  or  with  T,  this  effect  may  differ  after  collapsing  the 
2x2  x  K  table  over  centers.  The  estimated  XY  conditional  odds  ratio,  exp(/3),  typically 
differs  from  the  sample  odds  ratio  in  the  marginal  2x2  table. 

Next,  consider  the  simpler  model  that  lacks  center  effects,  logit(7r/<)  =  a  +  fix,.  This 
states  that,  for  each  treatment,  the  success  probability  is  identical  for  each  center.  The 
model  satisfies  a  collapsibility  condition  for  the  XY  association,  because  it  states  that  Z 
is  conditionally  independent  of  Y,  given  X.  So,  when  center  effects  are  negligible  and 
the  simpler  model  fits  nearly  as  well,  the  estimated  treatment  effect  is  approximately  the 
marginal  XY  odds  ratio. 

6.4.9  Testing  Homogeneity  of  Odds  Ratios 

The  homogeneous  association  condition  (?xr<!)  =  •  •  •  =  0xy(K)  for  2  x  2  x  K  tables  is 
equivalent  to  logistic  model  (6.9).  A  test  of  homogeneous  association  is  implicitly  a 
goodness-of-fit  test  of  this  model.  The  usual  G2  and  X2  test  statistics  provide  this,  with 
df  =  K  —  1 .  They  test  that  the  AT  —  1  parameters  in  the  saturated  model  that  are  the  co¬ 
efficients  of  interaction  terms  [cross-products  of  the  indicator  variable  for  A  with  ( K  —  1) 
indicator  variables  for  categories  of  Z]  all  equal  0. 

For  the  eight-center  clinical  trial  data  in  Table  6.9,  G2  =  9.75  and  X2  =  8.03  (df  =  7) 
do  not  contradict  the  hypothesis  of  equal  odds  ratios.  It  is  reasonable  to  summarize  the 
conditional  association  by  a  single  odds  ratio  (e.g.,  (?mh  =  2. 1 3  or  =  2. 1 7)  for  all  eight 
partial  tables. 
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6.4.10  Summarizing  Heterogeneity  in  Odds  Ratios 

In  practice,  the  effect  of  interest  is  often  similar  from  stratum  to  stratum.  In  multicenter 
clinical  trials  comparing  a  new  drug  to  a  standard,  for  example,  if  the  new  drug  is  truly 
more  beneficial,  the  population  effect  is  usually  positive  in  each  stratum. 

In  strict  terms,  however,  a  model  with  homogeneous  effects  is  unrealistic.  Consider  the 
odds  ratio,  to  illustrate.  First,  we  rarely  expect  the  true  odds  ratio  to  be  exactly  the  same 
in  each  stratum,  because  of  unmeasured  covariates  that  affect  it.  Breslow  (1976)  discussed 
modeling  of  the  log  odds  ratio  using  a  set  of  explanatory  variables.  Second,  the  model 
regards  the  strata  effects  \fi[ }  as  fixed  effects,  treating  them  as  the  only  strata  of  interest. 
Often  the  strata  are  merely  a  sampling  of  the  possible  ones.  Multicenter  clinical  trials  have 
data  for  certain  centers  but  many  other  centers  could  have  been  chosen.  Scientists  would 
like  their  conclusions  to  apply  to  all  such  centers,  not  only  those  in  the  study. 

A  somewhat  different  logistic  model  treats  the  true  log  odds  ratios  in  the  partial  tables  as 
a  random  sample  from  a  N (/i,  a2)  distribution.  Fitting  the  model  yields  an  estimated  mean 
log  odds  ratio  and  an  estimated  variability  about  that  mean.  The  inference  applies  to  the 
population  of  strata  rather  than  only  those  sampled.  This  type  of  model  uses  random  effects 
in  the  linear  predictor  to  induce  this  extra  type  of  variability.  In  Chapter  13,  we  discuss 
GLMs  with  random  effects,  and  in  Section  13.3.5  we  fit  such  a  model  to  Table  6.9. 


6.4.11  Propensity  Scores  in  Observational  Studies 

We  finish  this  section  by  mentioning  a  more  challenging  setting  for  analyzing  conditional 
associations  -  observational  studies  in  which  we  want  to  compare  two  groups  while  con¬ 
trolling  for  possibly  confounding  variables  x.  Rosenbaum  and  Rubin  (1983)  proposed 
methods  of  adjusting  for  bias  in  making  such  comparisons.  They  defined  the  propensity 
as  the  probability  of  being  in  one  group,  for  a  given  setting  of  the  explanatory  variables 
x.  They  used  logistic  regression  to  estimate  how  propensity  depends  on  x.  In  comparing 
the  groups  on  the  response  variable,  they  showed  how  to  control  for  differing  distributions 
of  the  groups  on  x  by  adjusting  for  the  estimated  propensity.  This  is  done  by  using  the 
propensity  to  match  samples  from  the  groups  or  to  subclassify  subjects  into  several  strata 
consisting  of  intervals  of  propensity  scores  or  to  adjust  directly  by  entering  the  propensity 
in  the  model. 

For  any  study  that  is  observational  rather  than  randomized,  there  is  still  the  limitation 
that  propensity  score  methods  adjust  only  for  observed  confounding  covariates  and  not  for 
unobserved  ones.  Also,  the  methods  work  better  in  larger  samples,  so  observed  covariates 
tend  to  be  more  truly  balanced  in  the  subclassifications.  In  various  writings,  Rubin  has 
pointed  out  that  confidence  in  causal  conclusions  based  on  such  methods  must  rely  on 
how  consistent  the  results  are  with  other  evidence  and  how  sensitive  the  conclusions  are  to 
reasonable  deviations  such  as  in  the  effects  of  unobserved  covariates. 


6.5  DETECTING  AND  DEALING  WITH  INFINITE  ESTIMATES 

The  log-likelihood  function  for  logistic  regression  models  is  strictly  concave.  ML  estimates 
exist  and  are  unique  except  in  certain  boundary  cases.  Estimates  do  not  exist  or  may  be 
infinite  when  there  is  no  overlap  in  the  sets  of  explanatory  variable  values  having  y  =  0 
and  having  y  =  1  (Albert  and  Anderson  1984). 
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6.5.1  Complete  or  Quasi-complete  Separation 

The  space  of  explanatory  variable  values  is  said  to  have  complete  separation  when  a 
hyperplane  can  pass  through  that  space  such  that  on  one  side  of  that  hyperplane  y,  =  0  for 
all  observations,  whereas  on  the  other  side,  y,  =  1  always.  This  means  that  there  exists  a 
vector  b  such  that 

bT  x,  >  0  whenever  y,  =  1 , 
bT  x,  <  0  whenever  y,  =  0. 

There  is  then  perfect  discrimination ,  as  we  can  predict  the  sample  outcomes  perfectly  by 
knowing  the  predictor  values. 

Figure  6.5  illustrates  for  a  single  explanatory  variable.  Here,  y  =  Oatx  =  10,20,  30,40, 
and  y  =  1  at  x  =  60,  70,  80,  90.  For  x,  =  (1,  jr,)r,  the  predictor  bT x,  —  —50  +  x,  [i.e., 
br  =  (—50,  1)]  gives  perfect  predictions.  An  ideal  fit  has  ft  =  0  for  x  <  50  and  ft  —  1 
for  x  >  50.  By  letting  /3  — »  oo  and,  for  fixed  /3,  letting  a  =  — /3(50)  so  that  ft  =  0.50  at 
x  =  50,  we  can  generate  a  sequence  with  ever-increasing  value  of  the  likelihood  function 
that  comes  successively  closer  to  a  perfect  fit. 

In  practice,  most  software  fails  to  recognize  when  some  ML  estimates  are  actually 
infinite.  After  a  few  cycles  of  iterative  fitting,  the  log  likelihood  looks  flat  at  the  working 
estimate,  and  convergence  criteria  are  satisfied.  Because  the  log  likelihood  is  so  flat  and 
because  the  variance  of  fj  comes  from  the  negative  inverse  of  the  matrix  of  second 
derivatives,  software  typically  reports  huge  standard  errors.  For  the  data  in  Figure  6.5,  for 
instance,  PROC  GENMOD  in  SAS  reports  logit(zr)  =  -192.2  +  3.8x  with  standard  errors 
of  8.0  x  108  and  1.5  x  107. 

In  practice,  an  indication  of  complete  separation  is  when  the  fitted  prediction  equation 
perfectly  predicts  the  response  outcome  for  the  entire  data,  giving  ft  =  1.0  (to  many 
decimal  places)  whenever  y  =  1  and  ft  =  0.0  whenever  y  =  0.  A  related  indication  is  that 
the  reported  maximized  log-likelihood  value  is  0  to  many  decimal  places.  Another  warning 
signal  is  when  standard  errors  seem  unnaturally  large.  When  there  is  indication  of  complete 
separation  for  a  model  containing  several  predictors,  using  the  forward  selection  algorithm 
can  reveal  a  subset  of  them  for  which  complete  separation  occurs  once  they  are  all  used. 

A  weaker  condition  that  causes  at  least  one  estimate  to  be  infinite,  called  quasi-complete 
separation ,  occurs  when  a  hyperplane  separates  explanatory  variable  values  with  y  =  1 
and  with  y  =  0,  but  cases  exist  with  both  outcomes  on  that  hyperplane.  For  example,  this 
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Figure  6.5  Perfect  discrimination  resulting  in  an  infinite  logistic  regression  parameter  estimate. 
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happens  if  we  add  to  Figure  6.5  two  observations  at  .r  =  50,  one  with  y  —  1  and  one 
with  y  —  0.  With  quasi-complete  separation,  there  is  not  perfect  discrimination  for  all 
observations.  The  maximized  log  likelihood  is  then  strictly  less  than  0.  An  indication  of 
quasi-complete  separation  is  that  some  observations  have  A  =  1 .0  or  0.0.  Again,  a  warning 
signal  is  when  reported  standard  errors  seem  unnaturally  large. 

When  complete  or  quasi-complete  separation  do  not  occur,  all  ML  estimates  are  finite  and 
unique.  Quasi-complete  separation  is  more  common  than  complete  separation.  It  is  more 
liable  to  happen  with  qualitative  predictors  than  quantitative  predictors.  If  any  category  of 
a  qualitative  predictor  has  either  no  cases  with  y  =  0  or  no  cases  with  y  —  1 ,  there  is  quasi- 
complete  separation  when  that  variable  is  entered  as  a  factor  in  the  model  (i.e.,  using  an 
indicator  variable  for  that  category).  With  many  predictors,  it’s  a  good  idea  to  cross-classify 
each  qualitative  predictor  with  y  to  check  for  an  empty  cell,  which  is  a  sufficient  condition 
for  quasi-complete  separation. 

With  an  infinite  estimate,  Wald  inference  is  worthless.  By  contrast,  we  can  still  compute 
likelihood-ratio  and  score  tests  and  invert  them  to  get  a  confidence  interval.  For  example, 
the  likelihood  still  has  a  maximized  value  at  the  infinite  estimate  for  a  parameter,  so  we 
can  compare  its  value  to  the  value  when  the  parameter  is  equated  to  some  fixed  value  such 
as  zero.  For  the  data  in  Figure  6.5,  the  likelihood-ratio  test  statistic  for  //(>:  —  0  is  1 1.09 

(df  =  1,  P  =  0.001 ),  and  the  95%  confidence  interval  for  p  is  (0.06,  oo),  so  we  can  conclude 
that  the  effect  is  positive  in  the  population. 

6.5.2  Example:  Multicenter  Clinical  Trial  with  Few  Successes 

Table  6.1 1  shows  results  of  a  clinical  trial  conducted  at  five  centers.  The  purpose  was  to 
compare  an  active  drug  to  placebo  for  treating  fungal  infections,  with  a  binary  (success, 
failure)  response.  For  these  data,  let  Y  —  response,  X  =  treatment  (1  =  active  drug, 
0  =  placebo),  and  Z  =  center. 


Table  6.1 1  Clinical  Trial  Relating  Treatment  to  Response.  Showing  also  XY  and  YZ 
Marginal  Tables 


Center  (Z) 

Treatment  (X) 

Response  (Y ) 

YZ  Marginal 

Success 

Failure 

Success 

Failure 

1 

Active  drug 

0 

5 

Placebo 

0 

9 

0 

14 

2 

Active  drug 

1 

12 

Placebo 

0 

10 

1 

22 

3 

Active  drug 

0 

7 

Placebo 

0 

5 

0 

12 

4 

Active  drug 

6 

3 

Placebo 

2 

6 

8 

9 

5 

Active  drug 

5 

9 

Placebo 

2 

12 

7 

21 

XY 

Active  drug 

12 

36 

marginal 

Placebo 

4 

42 

Source:  Data  courtesy  of  Diane  Connell,  Sandoz  Pharmaceuticals  Corporation. 
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Centers  1  and  3  had  no  successes.  Thus,  the  5x2  YZ  marginal  table  relating  response  to 
center  collapsed  over  treatment,  shown  on  the  right  side  of  Table  6.11,  contains  zero  counts. 
Infinite  ML  estimates  occur  for  terms  in  logistic  models  relating  to  the  YZ  association.  An 
example  is  the  model 


logitfjr,*)  =  fix,  +  fif . 

[We  take  out  the  intercept  from  (6.9),  so  the  Iff: }  need  no  constraint;  then,  these  refer  to 
each  center’s  effect  rather  than  contrasts  between  each  center  and  a  baseline  center.]  The 
likelihood  function  increases  continually  as  /3f  and  decrease  toward  — oo;  that  is,  as  the 
logit  decreases  toward  — oo,  so  the  fitted  probability  of  success  decreases  toward  the  ML 
estimate  of  0  for  those  centers. 

Because  of  the  infinite  estimates,  we  cannot  conduct  a  Wald  test  of  the  center  effects  in 
Table  6.11.  However,  SAS  (PROC  GENMOD)  reports  a  maximized  log-likelihood  value 
of  —28.87  for  this  model  and  —40.58  when  the  center  term  is  removed  from  the  model,  so 
the  likelihood-ratio  statistic  for  this  effect  equals  23.42  (df  =  4). 

The  counts  in  the  2x2  marginal  table  relating  response  to  treatment,  shown  in  the 
bottom  panel  of  Table  6.1 1,  are  all  positive.  The  empty  cells  affect  the  center  estimates,  but 
not  the  treatment  estimate,  for  this  model.  In  the  limit  as  the  log  likelihood  increases,  the 
fitted  values  have  a  log  odds  ratio  ^  =  1.55  (SE  =  0.70).  Most  software  reports  this  but, 
instead  of  pf  —  0%  =  — oo,  reports  large  numbers  with  extremely  large  standard  errors. 
For  instance,  PROC  GENMOD  in  SAS  reports  values  of  about  —26  for  and  ,  with 
standard  errors  of  about  200,000. 

The  treatment  estimate  $  —  1 .55  also  results  when  we  delete  centers  1  and  3  from  the 
analysis.  When  a  center  contains  responses  of  only  one  type,  it  provides  no  information 
about  this  odds  ratio.  (It  does  provide  information  about  the  size  of  some  other  mea¬ 
sures,  such  as  the  difference  of  proportions,  as  discussed  above  in  Section  6.4.6.)  Such 
tables  also  make  no  contribution  to  standard  tests  of  conditional  independence,  such  as  the 
Cochran-Mantel-Haenszel  test. 

An  alternative  strategy  in  multicenter  analyses  combines  centers  of  a  similar  type.  Then, 
if  each  resulting  partial  table  has  responses  with  both  outcomes,  the  inferences  use  all 
data.  For  Table  6.11,  perhaps  centers  1  and  3  are  similar  to  center  2,  since  the  success  rate 
is  very  low  for  that  center.  Combining  these  three  centers  and  refitting  the  model  to  this 
table  and  the  tables  for  the  other  two  centers  yields  0  —  1 .56  (SE  =  0.70).  Usually,  this 
strategy  produces  results  essentially  the  same  as  from  deleting  tables  with  no  outcomes  of 
a  particular  type. 

6.5.3  Remedies  When  at  Least  One  ML  Estimate  Is  Infinite 

What  can  you  do  if  there  is  complete  or  quasi-complete  separation  and  thus  at  least  one  ML 
estimate  does  not  exist?  As  just  mentioned,  you  can  still  usually  do  inference  about  that 
effect.  For  example,  you  can  conduct  a  likelihood-ratio  test.  If  f  =  oo,  a  profile  likelihood 
confidence  interval  will  have  the  form  (L,o o).  With  quasi-complete  separation,  some 
parameter  estimates  may  be  unaffected,  and  their  inference  will  resemble  the  usual.  With 
small  samples  and  categorical  predictors,  you  can  use  the  specialized  exact  conditional 
methods  to  be  presented  in  Section  7.3. 

Alternatively,  you  can  make  some  adjustment  so  all  estimates  are  finite.  For  example,  if 
a  category  of  a  qualitative  predictor  has  no  cases  with  y  =  1 ,  perhaps  combine  that  category 
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with  a  similar  one  such  that  outcomes  of  both  type  then  occur.  Some  approaches  smooth 
the  data,  thus  producing  finite  estimates.  The  Bayesian  approach  (Section  7.2)  is  the  best 
known  way  of  doing  that.  The  amount  of  smoothing  for  the  resulting  estimates  depend 
strongly  on  the  variability  in  the  Bayes  prior  distribution. 

A  related  way  maximizes  a  penalized  likelihood  function.  This  adds  a  term  to  the  ordinary 
log-likelihood  function  such  that  maximizing  the  amended  function  smooths  the  estimates 
by  shrinking  them  toward  0  (Firth  1993a).  Section  7.4.5  introduces  this  approach,  which 
corresponds  to  using  the  Bayesian  posterior  mode  induced  by  the  Jeffreys  prior  distribution. 
For  the  data  in  Figure  6.5,  this  method  replaces  the  infinite  estimate  of  j3  by  $  =  0.067 
( SE  =  0.042).  The  corresponding  95%  penalized  profile  likelihood  confidence  interval 
is  (0.013,  0.334).  Its  highly  asymmetric  form  about  /3  reflects  the  highly  nonsymmetric 
appearance  of  the  log-likelihood  function  for  such  data. 

6.6  SAMPLE  SIZE  AND  POWER  CONSIDERATIONS 

In  any  statistical  procedure,  the  sample  size  n  influences  the  results.  Strong  effects  are  likely 
to  be  detected  even  when  n  is  small.  By  contrast,  detection  of  weak  effects  requires  large  n. 
A  study  design  should  reflect  the  sample  size  needed  to  provide  good  power  for  detecting 
the  effect. 

6.6.1  Sample  Size  and  Power  for  Comparing  Two  Proportions 

For  test  statistics  having  large-sample  normal  distributions,  power  calculations  can  use 
ordinary  methods.  To  illustrate,  consider  a  test  comparing  binomial  parameters  7i\  and  tc2 
for  two  medical  treatments.  An  experiment  plans  independent  samples  of  size  n,  =  n/2 
receiving  each  treatment.  The  researchers  expect  m  %  0.60  for  each,  and  a  difference 
of  at  least  0.10  is  important.  In  testing  Hq\  tt\  =  n2,  the  variance  of  ft\  —  ji2  is  7T|(1  — 
ji\)/(n/2)  +  n2(\  -  7T2)/(n/2)  %  0.60  x  0.40  x  (4 /n)  —  0.96/ n.  In  particular, 


z  = 


(7T|  -  ft2)  -  (7T|  -  7 T2) 


V0.96//7 


has  approximately  a  standard  normal  distribution  for  tt\  and  tt2  near  0.60. 
The  power  of  an  a-level  test  of  Hq  is  approximately 


|7T|  -  ft2\ 

.  VO-96/ n 

When  jt\  —  n2  =  0.10,  for  a  =  0.05,  this  equals 
’(7T\  —  ft2)  —  0. 10 


>  2«/2 


+p 


s/0.96  Jh 

(ft  i  —  ft/)  —  0. 10 


1 .96-0.107/7/0.96 

-1.96  -0.1  OvVO.96 


^0.96/n 

=  P[z  >  1.96-0. 10vV0.96]+  P[z  <  - 1 .96  -  0.\0^n/0.96] 
=  1  -  <$>[1.96  -  0. IOvVO.96]  +  <t>[—  1 .96  -  0.10vV0.96], 
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Figure  6.6  Approximate  power  for  testing  equality  of  proportions,  with  true  values  near  middle  of  range  and 
a  =  0.05. 


where  <J>  is  the  standard  normal  cdf.  The  power  is  approximately  0. 1 1  when  n  —  50  and  0.30 
when  n  =  200.  It  is  not  easy  to  attain  significance  when  effects  are  small  and  the  sample 
size  is  not  very  large.  Figure  6.6  shows  how  the  power  increases  in  n  when  ii\  —  tt2  —  0. 10. 
By  contrast,  it  also  shows  how  the  power  improves  when  n\  —  ti2  =  0.20. 

For  specified  P(type  I  error)  =  a  and  /Ttype  II  error)  —  fi  (and  hence  power  —  1  —  fi), 
we  can  determine  the  sample  size  needed  to  attain  those  values.  A  study  using  n\  —  n2 
requires  approximately 


n  I  =  «2  =  (za/2  +  -/l)2[7f|(l  —  7T|)  +  7T2(1  -  Jt2)] /(7Ti  ~  U2)2 . 

For  a  test  with  a  =  0.05  and  fi  =0.10  when  n\  and  are  truly  about  0.60  and  0.70, 
n  i  =  n 2  =  473.  Similarly,  with  about  473  subjects  in  each  group,  a  95%  confidence  interval 
has  only  a  0. 10  chance  of  containing  0  when  actually,  ii\  —  0.60  and  7T2  =  0.70. 

This  sample-size  formula  is  approximate  and  may  underestimate  slightly  the  actual 
values  required.  It  is  adequate  for  most  practical  work,  though,  in  which  only  rough 
conjectures  are  available  for  n\  and  n2-  Farrington  and  Manning  (1990)  and  Fleiss  et  al. 
(2003.  Chap.  4)  showed  more  precise  formulas. 


6.6.2  Sample  Size  Determination  in  Logistic  Regression 

Consider  now  the  model  logit [tt (a,  )]  =  a  +  yx,  ,  i  =  1, . . . ,  n,  in  which  x  is  quantitative. 
[We  use  y  so  as  not  to  confuse  with  ft  =  P(type  II  error).]  The  sample  size  needed  to 
achieve  a  certain  power  for  testing  Hq:  y  =  0  depends  on  the  variance  of  y.  This  depends 
on  [tt (.V/ )},  and  formulas  for  n  use  a  guess  for  A  =  tt{x)  and  the  distribution  of  X.  The 
effect  size  is  the  log  odds  ratio  r  comparing  7r(.v)  to  tt(x  +  .v, ),  the  probability  at  a  standard 
deviation  above  the  mean  of  x.  For  a  one-sided  test  when  X  is  approximately  normal,  Flsieh 
( 1 989 )  derived 


n  —  [za  +  zp  exp(-r2/4)]2(l  +  2AS)/(Az2), 


(6.10) 
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where 


«  =  [!+(!+  r2)exp(5r2/4)]/[l  +  exp(-r2/4)]. 

The  value  n  decreases  as  n  — >  0.50  and  as  |r|  increases. 

We  illustrate  for  modeling  the  effect  of  x  =  cholesterol  level  on  the  probability  of  severe 
heart  disease  for  a  population  for  which  that  probability  at  an  average  level  of  cholesterol  is 
about  0.08.  Researchers  want  the  test  to  be  sensitive  to  a  50%  increase  in  this  probability, 
for  a  standard  deviation  increase  in  cholesterol.  The  odds  of  severe  heart  disease  at  the 
mean  cholesterol  level  equal  0.08/0.92  =  0.087,  and  the  odds  one  standard  deviation 
above  the  mean  equal  0. 12/0.88  =  0.136.  The  odds  ratio  equals  0.136/0.087  =  1.57,  and 
r  =  log(l  .57)  =  0.450.  For  a  =  0.05  and  /l  =  0. 10,  8  =  1 .306  and  n  =  612. 

6.6.3  Sample  Size  in  Multiple  Logistic  Regression 

A  multiple  logistic  regression  model  requires  larger  n  to  detect  effects.  Let  R  denote  the 
multiple  correlation  between  the  predictor  X  of  interest  and  the  others  in  the  model.  The 
formula  (6. 10)  for  n  divides  by  (1  —  R2).  In  that  formula,  ft  is  evaluated  at  the  mean  of  all 
the  explanatory  variables,  and  the  odds  ratio  refers  to  the  effect  of  X  at  the  mean  level  of 
the  other  predictors. 

Consider  the  example  in  Section  6.2.2  when  blood  pressure  is  also  a  predictor.  If 
the  correlation  between  cholesterol  and  blood  pressure  is  0.40,  we  need  n  612/[1  — 
(0.40)2]  =  729. 

These  formulas  provide,  at  best,  very  approximate  indications  of  sample  size.  Most 
applications  have  only  a  crude  guess  for  ft  and  R,  and  X  may  be  far  from  normally 
distributed. 


6.6.4  Power  for  Chi-Squared  Tests  in  Contingency  Tables 

When  hypotheses  are  false,  squared  normal  and  X2  and  G2  statistics  have  large-sample 
noncentral  chi-squared  distributions  (Section  5.3.8).  Suppose  that  Ho  is  equivalent  to  model 
M  for  a  contingency  table.  Let  it,  denote  the  true  probability  in  cell  i,  and  let  7tj(M)  denote  the 
value  to  which  the  ML  estimate  7f,  for  model  M  converges,  where  JT  m  —  JT  7T,-(Af )  =  1 . 
For  a  multinomial  sample  of  size  n,  the  noncentrality  parameter  for  X2  equals 


k  =  nJ2 


[7T,  -  Jt,{M)]2 
7Ti(M) 


(6.11) 


This  has  the  same  form  as  X2,  with  7t,  in  place  of  the  sample  proportion  p,  and  7 r,  (M)  in 
place  of  ft j .  The  noncentrality  parameter  for  G2  equals 


X  = 


2 n  ^  it-,  log 


7tj 

it, {M) 


(6.12) 


When  Ho  is  true,  all  tt,  =  jtj(M).  Then,  for  either  statistic,  X  =  0  and  the  ordinary  (central) 
chi-squared  distribution  applies. 

To  determine  the  approximate  power  for  a  chi-squared  test  with  df  =  v,  (1)  choose  a 
hypothetical  set  of  true  values  {7T,  },  (2)  calculate  {7 r,(M)}  by  fitting  to  {717}  the  model  M 
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Table  6.12  Power  of  Chi-Squared  Test  for  a  =  0.05 


Noncentrality 


df 

0.0 

0.2 

0.4 

0.6 

0.8 

1.0 

2.0 

3.0 

4.0 

5.0 

7.0 

10.0 

15.0 

25.0 

1 

.050 

.073 

.097 

.121 

.146 

.170 

.293 

.410 

.516 

.609 

.754 

.885 

.972 

.998 

2 

.050 

.065 

.081 

.098 

.115 

.133 

.226 

.322 

.415 

.504 

.655 

.815 

.944 

.996 

3 

.050 

.062 

.075 

.088 

.102 

.116 

.192 

.275 

.358 

.440 

.590 

.761 

.917 

.993 

4 

.050 

.060 

.071 

.082 

.093 

.106 

.172 

.244 

.320 

.396 

.540 

.716 

.891 

.989 

6 

.050 

.058 

.066 

.075 

.084 

.094 

.146 

.206 

.270 

.336 

.468 

.644 

.843 

.980 

8 

.050 

.057 

.064 

.071 

.079 

.087 

.131 

.182 

.238 

.296 

.417 

.588 

.799 

.968 

10 

.050 

.056 

.062 

.068 

.075 

.082 

.121 

.166 

.215 

.268 

.379 

.542 

.760 

.956 

20 

.050 

.053 

.056 

.060 

.063 

.066 

.096 

.125 

.158 

.193 

.273 

.402 

.611 

.883 

50 

.050 

.052 

.054 

.056 

.059 

.061 

.076 

.092 

.110 

.129 

.173 

.250 

.398 

.687 

Source:  Reprinted  with  permission  from  G.  E.  Haynam,  Z.  Govindarajulu,  and  F.  C.  Leone,  in  Selected  Tables  in 
Mathematical  Statistics,  eds.  H.  L.  Harter  and  D.  B.  Owen.  Chicago:  Markham,  1970. 


for  Hq,  (3)  calculate  the  noncentrality  parameter  X,  and  (4)  calculate  P[X*x  >  x^(«)]- 
Table  6.12  shows  an  excerpt  from  a  table  of  noncentral  chi-squared  probabilities  for  step  4 
with  a  =  0.05. 


6.6.5  Power  for  Testing  Conditional  Independence 

We  use  an  example  based  on  one  in  O’Brien  (1986).  A  standard  fetal  heart  rate  monitoring 
test  predicts  whether  a  fetus  will  require  nonroutine  care  following  delivery.  The  standard 
test  has  categories  (worrisome,  reassuring).  The  response  Y  is  whether  the  newborn  required 
some  nonroutine  medical  care  during  the  first  week  after  birth  ( 1  =  yes,  0  =  no).  A  new 
fetal  heart  rate  monitoring  test  is  developed,  having  categories  (very  worrisome,  somewhat 
worrisome,  reassuring).  A  physician  plans  to  study  whether  this  new  test  can  help  make 
predictions  about  the  outcome;  that  is,  given  the  result  of  the  standard  test,  is  there  an 
association  between  the  response  and  the  result  of  the  new  test?  A  relevant  statistic  tests 
the  effect  of  the  new  monitoring  test  in  the  logistic  model  having  the  new  test  ( N )  and  the 
standard  test  (S)  as  qualitative  predictors. 

To  help  select  n,  a  statistician  asks  the  physician  to  conjecture  about  the  joint  distribu¬ 
tion  of  the  explanatory  variables,  with  questions  such  as  “What  proportion  of  the  cases  do 
you  think  will  be  scored  ‘reassuring’  by  both  tests?”  For  each  NS  combination,  the  physi¬ 
cian  also  guessed  P(Y  =  1).  Table  6.13  shows  one  scenario  for  marginal  and  conditional 


Table  6.13  Scenario  for  Power  Computation 


Standard  Test 

New  Test 

Joint  Probability 

/'(nonroutine  care) 

Worrisome 

Very  worrisome 

0.04 

0.40 

Somewhat  worrisome 

0.08 

0.32 

Reassuring 

0.04 

0.27 

Reassuring 

Very  worrisome 

0.02 

0.30 

Somewhat  worrisome 

0.18 

0.22 

Reassuring 

0.64 

0.15 

Source:  Reprinted  with  permission  from  O’Brien  (1986). 
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probabilities.  These  yield  a  joint  distribution  {jr,y*}  from  their  product,  such  as  0.04  x 
0.40  =  0.016  for  the  proportion  of  cases  judged  worrisome  by  the  standard  test  and  very 
worrisome  by  the  new  test  and  requiring  nonroutine  medical  care.  These  joint  probabilities 
yield  fitted  probabilities  n(M0)  and  n{M\)  for  the  null  and  alternative  logit  models.  (We 
can  get  these  by  entering  {jr,-;*}  in  percentage  form  as  counts  in  software  for  logistic  regres¬ 
sion,  fitting  the  relevant  model,  and  dividing  the  fitted  counts  by  100  to  get  the  fitted  joint 
probabilities.)  The  likelihood-ratio  test  comparing  these  models  has  noncentrality  (6.12) 
with  n{M\)  playing  the  role  of  n  and  tc(Mo)  playing  the  role  of  n(M). 

For  the  scenario  in  Table  6.13,  the  noncentrality  equals  0.008 16«,  with  df  =  2.  For 
n  =  400,  600,  and  1000,  the  approximate  powers  when  a  =  0.05  are  0.35,  0.49,  and  0.73. 
This  scenario  predicts  64%  of  the  observations  to  occur  at  only  one  combination  of  the 
factors.  The  lack  of  dispersion  for  the  factors  weakens  the  power. 


6.6.6  Effects  of  Sample  Size  on  Model  Selection  and  Inference 

The  effects  of  sample  size  suggest  some  cautions  for  model  selection.  For  small  n,  the  most 
parsimonious  model  accepted  in  a  goodness-of-fit  test  may  be  quite  simple.  By  contrast, 
larger  samples  usually  require  more  complex  models  to  pass  goodness-of-fit  tests.  Then, 
some  effects  that  are  statistically  significant  may  be  weak  and  substantively  unimportant. 
With  large  n  it  may  be  adequate  to  use  a  model  that  is  simpler  than  models  that  pass 
goodness-of-fit  tests.  An  analysis  that  focuses  solely  on  goodness-of-fit  tests  is  incomplete. 
It  is  also  necessary  to  estimate  model  parameters  and  describe  strengths  of  effects. 

These  remarks  merely  reflect  limitations  of  significance  testing.  In  many  areas  of  ap¬ 
plication,  null  hypotheses  are  rarely  true.  With  large  enough  n,  they  will  be  rejected.  A 
more  relevant  concern  is  whether  the  difference  between  true  parameter  values  and  null 
hypothesis  values  is  sufficient  to  be  important.  Many  methodologists  overemphasize  test¬ 
ing  and  underutilize  estimation  methods  such  as  confidence  intervals.  When  the  P-value  is 
small,  a  confidence  interval  specifies  the  extent  to  which  Ho  may  be  false,  thus  helping  us 
determine  whether  rejecting  it  has  practical  importance.  When  the  /'-value  is  not  small,  the 
confidence  interval  indicates  whether  some  plausible  parameter  values  are  far  from  H().  A 
wide  confidence  interval  containing  the  Ho  value  indicates  that  the  test  had  weak  power  at 
important  alternatives. 


NOTES 

Section  6.1:  Strategies  in  Model  Selection 

6.1  AIC,  BIC:  For  cogent  arguments  supporting  the  use  of  AIC,  see  Burnham  and  Anderson  (2010). 
A  modified  version  is  recommended  if  the  number  of  parameters  is  large.  Some  statisticians 
believe  that  BIC  can  select  an  overly  simple  model.  For  this  and  other  critiques,  see  articles  by 
Gelman  and  Rubin,  Firth  and  Kuha,  Raftery,  Weakliem,  and  Xie,  in  the  February  1999  issue  of 
Sociological  Methods  and  Research. 


Section  6.2:  Logistic  Regression  Diagnostics 

6.2  Diagnostics:  Olive  and  Hawkins  (2005)  presented  graphics  that  are  useful  for  variable  selection. 
As  an  alternative  to  the  residual  methods  discussed,  smoothing  the  residuals  before  plotting 
them  (e.g.,  using  methods  to  be  presented  in  Section  7.4)  can  be  helpful  (Fowlkes  1987. 
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Lloyd  1999,  Sec.  5.4).  Cook  and  Weisberg  (1999,  Chap.  22)  and  Landwehr  et  al.  (1984) 
showed  other  examples  of  useful  diagnostic  plots.  For  other  logistic  regression  diagnostics,  see 
Copas  (1988),  who  also  considered  resistant  fitting  methods  (e.g.,  to  take  misclassification  into 
account),  Hosmer  and  Lemeshow  (2000,  Chap.  5),  Johnson  ( 1 985),  and  Pregibon  (1981). 


Section  6.3:  Summarizing  the  Predictive  Power  of  a  Model 

6.3  R 2  measures:  Amemiya  (1981),  Efron  (1978),  Hu  et  al.  (2005),  Liao  and  McGee  (2003), 
Maddala  (1983),  Schemper  (2003),  and  Zheng  and  Agresti  (2000)  and  references  therein  re¬ 
viewed  R2  measures  for  binary  regression.  Hosmer  and  Lemeshow  (2000,  Sec.  5.2.3)  discussed 
classification  tables  and  their  limitations.  Pepe  (2004)  and  references  therein  surveyed  ROC 
methodology. 


Section  6.4:  Mantel-Haenszel  and  Related  Methods  for  Multiple  2x2  Tables 

6.4  DIF:  One  application  of  CMH  methods  is  differential  item  functioning:  comparing  groups  in 
terms  of  how  different  they  are  in  responding  to  items  on  a  questionnaire,  after  adjusting  for 
overall  abilities  or  scores.  See  Holland  and  Wainer  (1993). 

6.5  Breslow-Day  test:  An  analog  of  $mh  and  <$mh  summarizes  relative  risks  from  several  strata 
(Greenland  and  Robins  1985).  Breslow  and  Day  (1980,  p.  142)  proposed  an  alternative  large- 
sample  test  of  homogeneity  of  odds  ratios.  In  each  partial  table  let  {/t,y*}  have  the  same 
marginals  as  the  data  observed,  yet  have  odds  ratio  equal  to  4Mh-  Their  test  statistic  has  the 
Pearson  form  comparing  {«,,<. }  to  {/z,y*  |.  Tarone  ( 1 985)  showed  that,  because  of  the  inefficiency 
of  0MH,  the  Breslow-Day  statistic  must  be  adjusted  for  it  to  have  exactly  a  limiting  chi-squared 
null  distribution  with  df  =  K  —  1 .  This  adjustment  is  usually  minor.  Other  work  on  comparing 
odds  ratios  and  estimating  a  common  value  includes  Breslow  and  Day  ( 1 980,  Sec.  4.4),  Donner 
and  Hauck  (1986),  Gart  (1970),  Jones  et  al.  (1989),  and  Liang  and  Self  (1985).  For  modeling 
the  odds  ratio,  see  Breslow  (1976),  Breslow  and  Day  (1980,  Sec.  7.5),  and  Prentice  (1976a). 
Breslow  emphasized  retrospective  studies,  in  which  the  conditional  approach  is  natural  since 
the  outcome  totals  are  fixed. 


Section  6.5:  Detecting  and  Dealing  with  Infinite  Estimates 

6.6  Infinite  ML:  For  discussion  of  this  topic,  including  other  link  functions  and  GLMs,  see  Albert 
and  Anderson  (1984),  Haberman  (1974a),  Santner  and  Duffy  (1986),  Silvapulle  (1981),  and 
Wedderbum  (1976). 

6.7  High  imbalance:  King  and  Zeng  (2001)  and  Owen  (2007)  discussed  applications  in  which 
one  outcome  category  is  much  more  common  than  the  other.  Examples  include  rare  diseases, 
fraudulent  use  of  a  credit  card,  and  non-spam  email  messages  in  spam  folders.  King  and  Zeng 
proposed  a  sampling  design  of  sampling  all  possible  cases  of  the  rare  outcome  and  a  much 
smaller  fraction  of  the  other  outcome.  Owen  showed  that  under  a  sampling  scheme  for  which 
n  — >  oo  while  the  number  of  outcomes  in  one  category  remains  finite,  a  limit  exists  for  the 
estimated  parameter  vector  that  depends  on  the  distribution  of  the  x  values. 


Section  6.6:  Sample  Size  and  Power  Considerations 

6.8  Noncentral  chi-squared:  Gail  and  Gart  (1973)  and  Suissa  and  Shuster  (1985)  studied  sample 
size  for  obtaining  fixed  power  in  Fisher’s  test.  Farrington  and  Manning  (1990)  considered 
sample  size  for  nonnuii  effects  for  the  difference  of  proportions  and  relative  risk  using  score- 
type  tests.  For  sample  size  determination  in  logistic  regression,  see  Hsieh  et  al.  (1998),  Lyles 
et  al.  (2006),  Schoenfeld  and  Borenstein  (2005),  Vaeth  and  Skovlund  (2004),  and  Whittemore 
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(1981).  Lachin  (1977)  considered  /  x  J  tables.  Drost  et  al.  (1989),  Haberman  (1974a,  pp. 
109-1 12),  Meng  and  Chapman  (1966),  Mitra  (1958),  and  Patnaik  (1949)  derived  theory  for 
asymptotic  nonnull  behavior  of  chi-squared  statistics;  see  also  Section  16.3.5.  O’Brien's  (1986) 
simulation  results  suggested  that  the  noncentral  chi-squared  approximation  for  G 2  holds  well 
for  a  wide  range  of  powers.  Read  and  Cressie  (1988,  pp.  147-148)  listed  other  articles  that 
studied  the  nonnull  behavior  of  X 2  and  G2. 


EXERCISES 

Applications 

6.1  For  the  horseshoe  crab  mating  data,  the  maximized  log-likelihood  value  is  —  1 1 2.88 

for  the  model  with  only  an  intercept,  —97.87  for  the  model  with  weight  as  a  predictor, 
—97.23  for  the  model  with  width  as  a  predictor,  and  —96.45  for  the  model  using  both 
as  predictors.  Conduct  (a)  a  test  of  H0:  =  f)2  =  0  for  the  joint  effects,  and  (b) 

separate  tests  for  the  partial  effects.  Why  does  neither  test  in  part  (b)  show  evidence 
of  an  effect  when  the  test  in  part  (a)  shows  strong  evidence? 

6.2  For  the  horseshoe  crab  mating  data.  Table  6. 1 4  shows  ML  estimates  for  two  models 
using  weight  and  color  (with  dark  color  as  the  baseline)  as  predictors  of  satellite 
presence.  Compare  the  models  using  a  likelihood-ratio  test  and  using  AIC.  Select  a 
model,  and  interpret  its  estimates. 


Table  6.14  Effects  for  Two  Models  with  Predictors  of  Crab  Satellites, 
for  Exercise  6.2 


Term 

Model  1 

Model  2 

Estimate 

SE 

Estimate 

SE 

Intercept 

-4.53 

1.00 

-1.19 

2.30 

Weight 

1.69 

0.39 

0.19 

1.03 

Color  1 

1.27 

0.85 

-0.43 

5.40 

Color  2 

1.41 

0.54 

-1.27 

2.58 

Color  3 

1.08 

0.59 

-6.73 

3.44 

Weight  x  Color  1 

0.85 

2.16 

Weight  x Color  2 

1.21 

1.14 

Weight  x  Color  3 

3.56 

1.56 

Log-likelihood 

-94.27 

-90.83 

AIC 

198.54 

197.66 

6.3  The  book’s  website  (www.stat.ufl.edu/~aa/cda/cda.html)  has  a  2  x  3  x 
2x2  table  relating  responses  on  frequency  of  attending  religious  services,  political 
views,  opinion  on  making  birth  control  available  to  teenagers,  and  opinion  about 
whether  premarital  sex  before  marriage  is  wrong.  Treating  opinion  about  premarital 
sex  as  the  response  variable,  use  backward  elimination  to  select  a  model.  Interpret. 

6.4  For  Table  10.1,  treating  marijuana  use  as  the  response  variable,  build  a  model 
with  alcohol  use,  cigarette  use,  gender,  and  race  as  potential  explanatory  variables. 
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Summarize  your  strategy  for  selecting  a  model,  and  interpret  your  final  choice  of 
model. 

6.5  For  Table  6.4,  fit  the  stage  3  model  denoted  there  by  (E  *  P  +  G).  Use  parameter 
estimates  to  interpret  the  G  effect  and  the  dependence  of  the  E  effect  on  P. 

6.6  According  to  the  Independent  newspaper  (London,  Mar.  8,  1994),  the  Metropolitan 
Police  in  London  reported  30,475  people  as  missing  in  the  year  ending  March  1993. 
For  those  of  age  1 3  or  less,  33  of  327 1  missing  males  and  38  of  2486  missing  females 
were  still  missing  a  year  later.  For  ages  14  to  18,  the  values  were  63  of  7256  males 
and  108  of  8877  females;  for  ages  19  and  above,  the  values  were  157  of  5065  males 
and  159  of  3520  females.  Analyze  by  building  a  model,  and  interpret.  (Thanks  to 
Pat  Altham  for  showing  me  these  data.) 

6.7  Fowlkes  et  al.  (1988)  reported  results  of  a  survey  of  employees  of  a  large  national 
corporation  to  determine  how  satisfaction  depends  on  race,  gender,  age,  and  regional 
location.  The  data  are  at  the  book’s  website.  Build  a  logistic  model  for  these  data 
and  carefully  interpret  the  parameter  estimates. 

6.8  Table  6.15  shows  the  results  of  a  study  about  Y  —  whether  a  patient  having  surgery 
with  general  anesthesia  experienced  a  sore  throat  on  waking  (0  =  no,  1  =  yes)  as 
a  function  of  the  D  =  duration  of  the  surgery  (in  minutes)  and  the  T  =  type  of 
device  used  to  secure  the  airway  (0  =  laryngeal  mask  airway,  1  =  tracheal  tube). 
Use  a  model-building  strategy  to  select  a  logistic  model  for  these  predictors.  For 
your  model,  interpret  parameter  estimates,  and  conduct  inference  about  the  effects. 


Table  6.15  Data  for  Exercise  6.8  on  Surgery  and  Sore  Throats 


Patient 

D 

T 

Y 

Patient 

D 

T 

Y 

Patient 

D 

T 

Y 

1 

45 

0 

0 

13 

50 

1 

0 

25 

20 

1 

0 

2 

15 

0 

0 

14 

75 

1 

1 

26 

45 

0 

1 

3 

40 

0 

1 

15 

30 

0 

0 

27 

15 

1 

0 

4 

83 

1 

1 

16 

25 

0 

1 

28 

25 

0 

1 

5 

90 

1 

1 

17 

20 

1 

0 

29 

15 

1 

0 

6 

25 

1 

1 

18 

60 

1 

1 

30 

30 

0 

1 

7 

35 

0 

1 

19 

70 

1 

1 

31 

40 

0 

1 

8 

65 

0 

1 

20 

30 

0 

1 

32 

15 

1 

0 

9 

95 

0 

1 

21 

60 

0 

1 

33 

135 

1 

1 

10 

35 

0 

I 

22 

61 

0 

0 

34 

20 

1 

0 

11 

75 

0 

1 

23 

65 

0 

1 

35 

40 

1 

0 

12 

45 

1 

1 

24 

15 

1 

0 

Source:  Data  from  “Binary  Data”  by  D.  Collett,  in  Encyclopedia  of  Biostatistics,  2nd  ed.  Hoboken,  NJ:  Wiley, 
2005,  pp.  439-446. 


6.9  Refer  to  the  previous  exercise.  Use  a  measure  of  predictive  power  to  compare  the 
fits  of  various  models  to  these  data. 
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6.10  Refer  to  the  previous  two  exercises.  For  your  preferred  model: 

a.  Summarize  predictive  power  using  classification  tables  with  jiq  —  0.50  and  7To  = 
y.  In  each  case,  report  and  interpret  the  sensitivity  and  specificity. 

b.  Summarize  predictive  power  using  an  ROC  curve.  Report  and  interpret  the  con¬ 
cordance  index. 

6.11  Discern  the  reasons  that  Simpson’s  paradox  occurs  for  the  graduate  admissions  data 
of  Table  6.6. 

6.12  Refer  to  Exercise  2.15  on  graduate  school  admissions  and  gender.  Fit  the  model 
of  no  G  effect,  given  the  department.  Use  X 2  to  test  the  fit.  Obtain  standardized 
residuals,  explain  how  they  relate  to  X 2,  and  interpret  the  lack  of  fit. 

6.13  Conduct  a  residual  analysis  for  the  independence  model  with  Table  5.5  on  treating 
leprosy.  What  type  of  lack  of  fit  is  indicated? 

6.14  For  the  horseshoe  crab  data,  use  methods  such  as  Section  6.3  shows  to  evaluate 
predictive  power  for  logistic  models  that  include  weight  and  color  as  explanatory 
variables. 

6.15  Table  6.16  refers  to  the  effectiveness  of  immediately  injected  or  1 1  -hour-delayed 
penicillin  in  protecting  rabbits  against  lethal  injection  with  /l-hemolytic  strepto¬ 
cocci. 

a.  Let  X  =  delay,  Y  =  whether  cured,  and  Z  =  penicillin  level.  Fit  the  logistic  model 
(6.4).  Argue  that  the  pattern  of  0  cell  counts  suggests  that  (with  no  intercept) 

=  —oo  and  0$  =  oo.  What  does  your  software  report? 

b.  Using  the  logistic  model,  conduct  the  likelihood-ratio  test  of  XY  conditional 
independence.  Interpret. 


Table  6.16  Data  for  Exercise  6.15  on  Penicillin 
Treatment  for  Streptococcus 


Penicillin 

Level 

Delay 

Response 

Cured 

Died 

l 

8 

None 

0 

6 

\\ h 

0 

5 

1 

4 

None 

3 

3 

Ij  h 

0 

6 

1 

2 

None 

6 

0 

4  h 

2 

4 

1 

None 

5 

1 

4 h 

6 

0 

4 

None 

2 

0 

4 h 

5 

0 

Source:  Reprinted  with  permission  from  Mantel  (1963). 
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c.  Test  XY  conditional  independence  using  the  Cochran-Mantel-Haenszel  test.  In¬ 
terpret. 

d.  Estimate  the  XY  conditional  odds  ratio  using  (i)  ML  with  the  logistic  model,  and 
(ii)  the  Mantel-Haenszel  estimate.  Interpret. 

6.16  Refer  to  Table  2.6.  Use  the  CMH  statistic  to  test  independence  of  death  penalty 
verdict  and  victim’s  race,  controlling  for  defendant’s  race.  Conduct  another  test  of 
this  hypothesis,  and  compare  results. 

6.17  Treatments  A  and  B  were  compared  on  a  binary  response  for  40  pairs  of  subjects 
matched  on  relevant  covariates.  For  each  pair,  treatments  were  assigned  to  the  sub¬ 
jects  randomly.  Twenty  pairs  of  subjects  made  the  same  response  for  each  treatment. 
Six  pairs  had  a  success  for  the  subject  receiving  A  and  a  failure  for  the  subject 
receiving  B,  whereas  the  other  14  pairs  had  a  success  for  B  and  a  failure  for  A. 
Use  the  Cochran-Mantel-Haenszel  procedure  to  test  independence  of  response  and 
treatment.  (In  Section  11.1  we  present  an  equivalent  test,  McNemar's  test.) 

6.18  For  the  data  summarized  in  Figure  1  of  the  201 1  Lancet  article  by  Rothwell  et  al. 
(377:  31^11)  from  eight  studies  on  the  effect  of  daily  aspirin  on  cancer  deaths, 
conduct  a  meta-analysis  that  combines  a  significance  test  with  a  confidence  interval 
to  summarize  the  size  of  effect.  Interpret. 

6.19  For  the  data  summarized  in  Figure  1  of  the  2010  American  Statistician  article  by 
Kulinskaya  et  al.  (64: 350-356),  conduct  a  meta-analysis  that  combines  a  significance 
test  with  a  confidence  interval  to  summarize  the  size  of  effect.  Interpret. 

6.20  A  data  set  at  the  text  website  from  a  2005  article  by  D.  Potter  ( Statist .  Med.  24: 
693-708)  describes  results  from  a  study  in  which  subjects  received  a  drug  and  the 
outcome  measures  whether  the  subject  became  incontinent  (y  =  1 ,  yes;  y  —  0, 
no).  The  three  explanatory  variables  are  lower  urinary  tract  variables  that  represent 
drug-induced  physiological  changes. 

a.  Find  the  prediction  equations  when  each  predictor  is  used  separately  in  logistic 
regressions. 

b.  Try  to  fit  a  main-effects  logistic  model  containing  all  three  predictors.  What  does 
your  software  report  for  the  effects  and  their  standard  errors?  (The  ML  estimates 
are  actually  — oo  for  at  and  xi  and  oo  for  .V3.)  Can  you  see  a  pattern  in  the  data 
that  is  responsible  for  this  behavior? 

6.21  Refer  to  the  example  of  complete  separation  in  Section  6.5. 1 .  For  the  8  observations, 
randomly  generate  values  for  a  second  predictor  from  the  N( 0,  1 )  distribution.  Taking 
both  explanatory  variables  in  your  model,  is  there  still  complete  separation?  Is  there 
quasi-complete  separation?  What  does  your  software  report  for  the  model  parameter 
estimates  and  SE  values? 

6.22  Refer  to  the  multicenter  clinical  trial  of  Table  6.1 1 . 

a.  Fit  the  main  effects  model  considered  in  the  text  with  your  favorite  software 
(omitting  the  intercept),  and  summarize  results. 
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b.  For  Center  1 ,  add  s  successes  for  the  active  treatment,  and  report  the  impact  (if 
any)  on  and  f>.  Do  this  for  s  —  10-6,  s  =  10-3,  e  =  0.50.  Do  such  centers 
give  any  information  about  the  treatment  log  odds  ratio  effect,  as  described  by  f> 
and  its  SEl 

6.23  Apply  the  logistic  regression  model  to  the  2x2  table  consisting  of  the  data  for 
Center  5  in  Table  6.9,  where  x  =  1  for  drug  and  x  =  0  for  control. 

a.  Report  the  ML  estimate  f. 

b.  What  does  your  software  report  when  you  try  to  fit  this  model?  Explain  why. 

c.  Can  you  construct  a  95%  confidence  interval  for  /)?  Show  how. 

6.24  For  the  example  in  Section  6.6.1,  suppose  tx\  =  0.70  and  n 2  =  0.60.  What  sample 

size  is  needed  for  the  test  to  have  approximate  power  0.80,  when  a  =  0.05,  for  (a) 
Ha :  7i]  7x2  and  (b)  Ha:  tx\  >  ^2? 

6.25  For  the  example  in  Section  6.6.1  with  equal  treatment  sample  sizes,  suppose  7X\  = 
0.63  and  7x2  =  0.57.  Explain  why  the  joint  probabilities  in  the  2x2  table  are  0.315 
and  0.185  for  treatment  A  and  0.285  and  0.215  for  treatment  B.  For  the  model  of 
independence,  explain  why  the  fitted  joint  probabilities  are  0.30  for  success  and  0.20 
for  failure,  in  each  row.  Show  that  X2  has  noncentrality  parameter  0.00375m  and 
df  =  1 .  For  m  =  200  and  a  =  0.05,  find  the  power. 

6.26  An  experiment  is  designed  to  compare  two  treatments  on  a  three-category  response. 
The  researcher  expects  the  conditional  distributions  to  be  approximately  (0.2,  0.2, 
0.6)  and  (0.3,  0.3, 0.4). 

a.  With  100  observations  for  each  treatment  and  a  =  0.05,  find  the  approxi¬ 
mate  power  to  compare  the  distributions  using  (i)  X2  and  (ii)  G2.  Compare 
results. 

b.  What  sample  size  is  needed  for  each  treatment  for  the  tests  in  (a)  to  have  approx¬ 
imate  power  0.90? 

6.27  The  horseshoe  crab  width  values  in  Table  4.3  have  x  =  26.3  and  sx  =  2. 1 .  If  the  true 
relationship  were  similar  to  the  fitted  equation  in  Section  5.1.3,  about  how  large  a 
sample  yields  P(type  II  error)  =  0.10,  with  a  =  0.05,  fortesting  Hq\  f  =  0  against 
Ha:p>  0? 

6.28  This  book’s  website  (www.  stat .  uf  1 .  edu/~aa/cda/cda  .html)  contains  a  five¬ 
way  table  relating  occupational  aspirations  (high,  low)  to  gender,  residence,  IQ,  and 
socioeconomic  status.  Analyze  these  data. 

6.29  In  recent  years  there  has  been  controversy  about  the  effects  of  rosiglitazone  (an 
antidiabetic  drug)  on  myocardial  infarction  (MI)  and  cardiovascular  mortality.  Re¬ 
view  the  2010  meta-analysis  by  S.  Nissen  and  K.  Wolski  in  Archives  of  Internal 
Medicine  (14:  1191-1201).  Conduct  your  own  analysis  of  the  effects  of  rosiglitazone 
on  MI. 
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Theory  and  Methods 

6.30  For  a  sequence  of  s  nested  models  M\ , . . . ,  Ms,  model  Ms  is  the  most  complex.  Let 
v  denote  the  difference  in  residual  df  between  M\  and  M  s . 

a.  Explain  why  for  j  <  k,  G2{Mj\Mk)  <  G2(M j\Ms). 

b.  Assume  model  Mj,  so  that  Mk  also  holds  when  k  >  j .  For  all  k  >  j,  as  n  — »  oo, 
P[G2(M j\Mk)  >  /2(a)]  <  a.  Explain  why. 

c.  Gabriel  (1966)  suggested  a  simultaneous  testing  procedure  in  which,  for  each 
pair  of  models,  the  critical  value  for  differences  between  G2  values  is  y2(u).  The 
final  model  accepted  must  be  more  complex  than  any  model  rejected  in  a  pairwise 
comparison.  Since  part  (b)  is  true  for  all  j  <  k,  argue  that  Gabriel’s  procedure 
has  type  I  error  probability  no  greater  than  a. 

6.31  Prove  that  the  Pearson  residuals  for  the  linear  logit  model  applied  to  a  /  x  2  con¬ 
tingency  table  satisfy  X2  =  ^-=l  e2.  [ Hint :  Start  with  the  X2  sum  over  the  2/  cells 
and  combine  the  two  terms  from  the  same  row.]  Note  that  this  holds  for  a  binomial 
GLM  with  a  linear  trend  for  any  link  function. 

6.32  For  ungrouped  binary  data,  explain  why  when  ftj  is  near  1 ,  residuals  are  necessarily 
either  small  and  positive  or  large  and  negative.  What  happens  when  ft;  is  near  0? 

6.33  For  a  2  x  2  x  K  table  from  a  multicenter  clinical  trial,  one  center  has  entries  (0,  n) 
in  row  1  and  (0,  2 n)  in  row  2  (i.e.,  no  successes  for  either  treatment). 

a.  Explain  why  there  is  no  information  in  this  table  about  whether  there  is  an  asso¬ 
ciation,  regardless  of  the  value  of  n.  [Hint:  Show  that  ft\  —  ftz  =  0  has  estimated 
null  SE  =  0,  and  the  E-value  is  1 .0  for  Fisher’s  exact  test  or  for  an  unconditional 
exact  test.] 

b.  Explain  why  there  is  information  in  the  table  about  the  size  of  association, 
in  terms  of  the  difference  of  proportions,  and  the  precision  of  information 
increases  as  n  increases.  Illustrate  by  finding  the  95%  score  confidence  in¬ 
tervals  for  7i \ ,  7T2,  and  Tt\  —  Ttj,  when  n  =  10  and  when  n  =  100.  (See 
www .  stat .  uf  1 .  edu/~aa/ cda/R  for  R  functions.  Note  that  Wald  intervals 
are  useless  for  such  data.) 

6.34  Refer  to  logit  model  (6.4)  for  a  2  x  2  x  K  contingency  table  {«;,*  }.  Using  a  basic 
result  for  testing  in  exponential  families,  explain  why  uniformly  most  powerful 
unbiased  tests  of  conditional  XY  independence  are  based  on  n\\k  (Birch  1964b; 
Lehmann  and  Romano  2005,  Sec.  4.8). 

6.35  Suppose  that  [n,jk }  in  a  2  x  2  x  2  table  are,  by  row,  (0.15,  0.10  /  0.10,  0.15)  when 
Z  =  1  and  (0.10, 0.15 /0. 15, 0.10)  when  Z  =  2.  For  testing  conditional  XY  indepen¬ 
dence  with  logistic  models  having  Y  as  a  response,  explain  why  the  likelihood-ratio 
test  comparing  models  X  +  Z  and  Z  is  not  consistent  but  the  likelihood-ratio  test  of 
fit  of  the  XY  conditional  independence  model  is. 

6.36  For  2x2  tables  with  all  marginal  totals  positive,  explain  what  patterns  of  0  cell 
counts  correspond  to  (a)  complete  separation  and  (b)  quasi-complete  separation. 
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6.37  For  A' explanatory  variables,  suppose  logistic  regression  has  finite  parameter  estimates 
when  used  with  each  predictor  alone.  Explain  why  infinite  estimates  could  occur 
when  the  predictors  are  all  used  in  a  main-effects  model.  Sketch  a  graph  with  k  =  2 
to  illustrate  this. 

6.38  In  Table  6.11,  suppose  the  outcome  of  0  successes  for  the  active  drug  in  Centers  1 
and  3  was  instead  a  positive  count,  but  there  were  still  no  successes  for  placebo  in 
those  centers.  Explain  why  all  estimates  would  be  finite  for  the  main-effects  model 
fitted  in  Section  6.5.2,  but  infinite  estimates  would  occur  for  the  more  general  model 
permitting  center-by-treatment  interaction. 

6.39  Explain  why  complete  or  quasi-complete  separation  would  not  cause  ML  estimates 
to  be  infinite  if  you  were  using  the  identity  link  function  but  might  cause  other 
problems  with  the  iterative  fitting  process. 

6.40  ForaGLM,  let  /t<-)  =  (/2,_l),  . . . ,  where  jxi~')  denotes  the  estimate  of  E(Y,) 

for  observation  /  after  fitting  the  model  without  that  observation.  The  leave-one- 
out  cross-validation  adjustment  to  the  predictive  measure  R(y,  fL)  is  corr(y,  /t(_)). 
For  binary  data,  consider  the  model,  logit(7r,)  =  a  for  all  i.  Show  that  ft,  =  y, 
ft{~‘)  —  [n/(n  —  l)][y  —  (l/«)y,  ],  and  hence  corr(y,  =  —  1 .  This  suggests  that 
leave-one-out  cross-validation  can  be  misleading  for  estimating  the  correlation  with 
model  logit(7T,- )  —  a  +  fix  when  the  true  effect  is  very  weak  (Zheng  and  Agresti 
2000). 

6.41  Using  graphs  or  tables,  explain  what  is  meant  by  no  interaction  in  modeling  response 
variable  Y  and  explanatory  variables  X  and  Z  when: 

a.  All  variables  are  continuous  (multiple  regression). 

b.  Y  and  X  are  continuous,  Z  is  categorical  (analysis  of  covariance). 

c.  Y  is  continuous,  X  and  Z  are  categorical  (two-way  ANOVA). 

d.  Y  is  binary,  X  and  Z  are  categorical  (logistic  regression). 


CHAPTER  7 


Alternative  Modeling  of 
Binary  Response  Data 


In  Chapters  5  and  6  we  have  focused  on  logistic  regression  modeling  of  binary  response 
data.  This  chapter  presents  some  alternative  ways  of  modeling  binary  data. 

Although  the  logit  is  the  most  popular  link  function  for  binary  responses,  other  links 
are  sometimes  more  appropriate.  In  Section  7. 1  we  present  the  probit  model,  which  results 
from  normal  latent  variable  models.  We  also  present  models  using  a  double  log  link  func¬ 
tion,  which  imply  nonsymmetric  response  curves.  In  Section  7.2  we  introduce  Bayesian 
approaches  for  modeling  binary  responses.  For  small  samples  or  models  with  many  param¬ 
eters,  ordinary  ML  inference  may  perform  poorly.  In  Section  7.3  we  discuss  conditional 
logistic  regression.  This  method  uses  conditioning  arguments  to  eliminate  nuisance  pa¬ 
rameters  and  can  provide  inference  based  on  exact  distributions  rather  than  large-sample 
approximations. 

In  Section  7.4  we  present  methods  for  discovering  structure  by  smoothing  the  data. 
A  simple  version  of  kernel  smoothing  estimates  a  probability  at  any  point  simply  by 
averaging  binary  data  at  nearby  points.  The  penalized  likelihood  method  maximizes  an 
adjusted  (“penalized”)  version  of  the  likelihood  function,  producing  parameter  estimates 
that  tend  to  be  more  smooth,  with  some  of  the  estimates  possibly  even  shrinking  to  0  under 
one  type  of  penalty.  The  generalized  additive  model  extends  generalized  linear  models  by 
allowing  an  unspecified  function  of  an  explanatory  variable  as  a  predictor.  The  final  section 
discusses  some  issues  that  arise  in  using  the  models  for  binary  data  sets  having  very  large 
numbers  of  potential  explanatory  variables. 

7.1  PROBIT  AND  COMPLEMENTARY  LOG-LOG  MODELS 

In  this  section  we  present  two  alternatives  to  logistic  models  for  binary  responses.  These 
models  forn-(x)  =  P(Y  =  1)  have  form 

gbr(x)]  =  a  +  Mi  +  M2  H - F  Ppxp,  (7.1) 

for  a  link  function  g  other  than  the  logit. 
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7.1.1  Probit  Models:  Three  Latent  Variable  Motivations 

In  Section  4.2.6  we  saw  that  in  toxicology  studies  with  dosage  predictor  x,  a  latent  variable 
model  naturally  leads  to  a  binary  regression  model.  Specifically,  a  tolerance  distribution 
with  cdf  F  for  the  dosage  that  induces  a  success  response  implies  a  model  in  which  the 
link  g  is  the  inverse  of  a  standardized  cdf  $  for  the  family  to  which  F  belongs.  That  is,  the 
model  has  form 


<P">  [*(*)]  =a  +  px,  (7.2) 

with  link  function  <t>-1.  Toxicological  experiments  often  measure  dosage  as  the  log  con¬ 
centration  and  take  the  tolerance  distribution  to  be  approximately  N(p,  a2)  for  unknown 
li  and  a  (Bliss  1935).  If  F  is  a  normal  cdf,  then  7r(.r)  satisfies  this  model  with  <t>  as  the 
standard  normal  cdf.  It  is  called  the  probit  model.  The  probit  link  function  is  <J>_1(  ). 

A  related  normal  latent  variable  model,  referred  to  as  a  threshold  model ,  also  implies 
the  probit  model.  This  model  assumes  there  is  an  unobserved  continuous  response  y*  such 
that  the  observed  response  y  =  0  if  y*  <  r  and  y  =  1  if  y*  >  r.  Suppose  that  y*  =  p  - F  e, 
where  p  =  a  +  fix  and  where  {e,  }  are  independent  from  a  N( 0,  a2)  distribution.  Then, 

P(Y  =  1)  =  P(Y*  >  t)  =  P(a  +  fix  +  e  >  t) 

=  P(—e  <  a  +  fix  —  r)  =  d>[(a  +  fix  —  r )/o], 

(Note  that  —  e  has  the  same  distribution  as  s.)  There  is  no  information  in  the  data  about  a 
or  the  threshold  r.  An  equivalent  model  results  if  we  multiply  (a,  /3,  a,  r)  by  any  positive 
constant.  For  identifiability,  we  set  a  =  1  and  r  =  0.  Thus,  the  probit  model  results.  The 
logistic  model  follows  when  £  has  instead  a  standard  logistic  distribution. 

A  third  normal  latent  variable  derivation  of  the  probit  model  is  based  on  utility  functions. 
Consider  the  choice  between  two  options,  such  as  two  product  brands.  Let  U o  denote  the 
utility  of  outcome  y  =  0  and  U\  the  utility  of  y  =  1.  For  y  —  0  and  1,  suppose  that 
(Jy  =  ay  -F  pyx  +  ev.  A  particular  subject  selects  y  =  1  if  their  U\  >  Uq.  Now  suppose 
that  eo  and  are  independent  N( 0,  1)  random  variables.  Then, 


P(Y  =  1)  =  P(a i  +  P\X\  +  fi  >  do  +  PoXq  -F  fo) 

=  F{(e0  -  e,)/V2  <  [(a,  -  a0)  +  (0,  -  Po)x]/ sfl\  =  Hr  P*x), 

where  a*  =  ( a\  -  a0)/\/2  and  p*  —  (/3|  -  Po)/V 2.  This  is  the  probit  model. 

All  three  of  these  latent  variable  approaches  extend  directly  to  multiple  explanatory 
variables.  The  probit  extends  to  an  inverse  t  link,  for  which  corresponding  latent  variable 
models  can  better  accommodate  outliers. 

7.1.2  Probit  Models:  Interpreting  Effects 

For  the  probit  model  with  a  single  quantitative  predictor,  the  response  curve  for  jt (x )  for 
for  1  —  tt(jc),  when  p  <  0]  has  the  appearance  of  the  normal  cdf  with  mean  p  =  —a/p 
and  standard  deviation  o  —  \/\P\.  Since  68%  of  the  normal  density  falls  within  a  standard 
deviation  of  the  mean,  \./\P\  is  the  distance  between  .rvalues  where  n{x)  =  0. 16  or0.84  and 
where  n(x)  =  0.50.  The  instantaneous  rate  of  change  in  tt(x)  is  dn{x)/dx  =  P<f>(a  +  Px), 


PROBIT  AND  COMPLEMENTARY  LOG-LOG  MODELS 


253 


where  </>(•)  is  the  standard  normal  density  function.  The  rate  is  highest  when  a  +  fix  =  0 
(i.e.,  at  x  =  —a/fi),  where  it  equals  fi/(2n)]/2  =  0.40/1  (for  n  =  3.14 . . .).  At  that  point, 

7T(X)  = 

By  comparison,  in  logistic  regression  with  parameter  fi,  the  curve  for  tt(x)  is  a  logistic 
cdf  with  standard  deviation  n /\fi\  \/3.  Its  rate  of  change  in  n(x)  at  a  =  —a/fi  is  0.25  fi.  The 
rates  of  change  where  n(x)  —  j  are  the  same  for  the  cdf’s  corresponding  to  the  probit  and 
logistic  curves  when  the  logistic  fi  is  0.40/0.25  =  1.60  times  the  probit  fi.  The  standard 
deviations  are  the  same  when  the  logistic  fi  is  tt/\/3  =  1.81  times  the  probit  fi.  When  both 
models  fit  well,  parameter  estimates  in  logistic  regression  are  about  1.6  to  1.8  times  those 
in  probit  models. 

Parameters  in  probit  models  can  be  interpreted  in  terms  of  effects  on  E(Y*)  for  the  thresh¬ 
old  latent  variable  model  presented  above.  Since  Y*  —  a  +  fix  +  e  where  e  ~  N(0,  1)  has 
cdf  <J>,  a  1-unit  increase  in  x  corresponds  to  a  fi  increase  in  E(Y*).  When  e  is  not  in 
standardized  form  with  a  =  1,  a  1-unit  increase  in  x  corresponds  to  a  fi  standard  deviation 
increase  in  E(Y*).  Alternatively,  we  can  summarize  effects  on  the  probability  scale,  such 
as  by  comparing  estimated  probabilities  at  extreme  values  or  quartiles  of  a  predictor,  with 
other  predictors  set  at  their  means.  (This  was  discussed  for  logistic  models  in  Section 
5.1.1.)  Although  probit  model  parameter  estimates  are  on  a  different  scale  than  logistic 
model  parameter  estimates,  the  probability  summaries  of  effects  are  similar. 


7.1.3  Probit  Model  Fitting 

Let  y,  be  the  number  of  successes  out  of  «,  trials  at  setting  Xj  of  possibly  multiple  explana¬ 
tory  variables,  i  =  N .  Let  x,j  denote  the  value  of  predictor  j  for  subject  i.  For  the 

probit  model  <t>—1  [7r(jr,)]  =  ]F]  •  fijXtj  with  .v,()  =  1  and  fio  =  a,  the  log-likelihood  function 
is 


Ufi)  =  log 


n 

1=1  L 


<t> 


X>-' 


i  >'  r 


x>-> 


n,  -.v/ 


Differentiation  with  respect  to  fij  leads  to  a  special  case  of  the  likelihood  equations  (4.27) 
for  binomial  regression  models. 
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with  </>(•)  the  standard  normal  pdf.  When  the  link  function  is  not  the  canonical  one  (which 
is  the  logit  for  binary  data),  there  is  no  reduction  of  the  data  in  sufficient  statistics.  Fisher 
(1935b),  in  an  appendix  to  Bliss  (1935)  for  the  single  predictor  case,  showed  how  to  solve 
these  equations  using  the  algorithm  now  referred  to  as  Fisher  scoring.  He  also  pointed  out 
that  cases  with  y,  =  0  or  y;  =  were  not  problematic  for  ML  fitting,  unlike  weighted  least 
squares  using  sample  probits  (or  logits). 

The  estimated  asymptotic  covariance  matrix  of  fi  has  the  GLM  form  (4.31) 


cov(j 8)  =  ( XTWX)~] . 
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For  probit  models,  VF  is  the  diagonal  matrix  with  elements 


n, 


$  ^  Pjx‘^j  j  ^  |  Pjx'j 


i 


The  Newton-Raphson  algorithm  yields  the  same  ML  estimates  but  slightly  different  stan¬ 
dard  errors.  For  the  information  matrix  inverted  to  obtain  the  asymptotic  covariance  matrix, 
Newton-Raphson  uses  observed  information,  whereas  Fisher  scoring  uses  expected  infor¬ 
mation.  These  differ  for  link  functions  other  than  the  canonical  link. 


7.1.4  Example:  Modeling  Flour  Beetle  Mortality 

Table  7.1 ,  from  the  Bliss  (1935)  article  on  probit  modeling,  reports  the  number  of  adult  flour 
beetles  killed  after  5  hours  of  exposure  to  gaseous  carbon  disulfide  at  various  concentrations. 
Figure  7. 1  plots  (as  dots)  the  proportion  killed  against  the  log  concentration.  The  proportion 
jumps  up  at  about  x  =  1 .8,  and  it  is  close  to  1  above  there. 

The  ML  fit  of  the  probit  model  is 

<f>-'  [tt(.v)]  =  -34.94+  19.73a  . 

For  this  fit,  A(x)  =  0.50  a.t  x  —  —&/$  =  34.94/19.73  =  1.77.  The  fit  corresponds  to  a 
normal  tolerance  distribution  with  n  —  1.77andcr  =  1  / 1 9.73  =  0.05.  Thecurve  forif  (.v)  is 
that  of  a  N{  1 .77, 0.052)  cdf.  As  x  increases  from  1 .6907  to  1 .8839,  the  estimated  probability 
of  death  increases  from  0.057  to  0.987.  For  a  0.10-unit  increase  in  x,  such  as  from  1.70 
to  1 .80,  we  estimate  that  the  conditional  distribution  of  the  latent  variable  y*  shifts  up  by 
0.10(19.73)  %  2  standard  deviations. 

At  dosage  a,  with/?,  beetles,  «,+(+)  is  the  fitted  count  for  death,  i  =  1 ,  . . . ,  8.  Table  7.1 
reports  the  fitted  values  and  Figure  7.1  shows  the  lit.  The  table  also  shows  fitted  values 
for  the  linear  logit  model.  These  models  fit  similarly  and  rather  poorly.  The  deviance  G2 
goodness-of-fit  statistic  equals  1 1.23  for  the  logit  model  and  10.12  for  the  probit  model, 
with  df  =  6.  Bliss  found  an  improved  fit  by  combining  a  probit  model  for  the  lowest  three 


Table  7.1  Beetles  Killed  After  Exposure  to  Carbon  Disulfide 


Log  Dose 

Number 

of  Beetles 

Number 

Killed 

Fitted  Values 

Comp.  Log-Log  Probit 

Logit 

1.6907 

59 

6 

5.6 

3.4 

3.5 

1.7242 

60 

13 

11.3 

10.7 

9.8 

1.7552 

62 

18 

21.0 

23.5 

22.5 

1.7842 

56 

28 

30.4 

33.8 

33.9 

1.81 13 

63 

52 

47.8 

49.6 

50.1 

1.8369 

59 

53 

54.1 

53.3 

53.3 

1.8610 

62 

61 

61.1 

59.7 

59.2 

1.8839 

60 

60 

59.9 

59.2 

58.7 

Source :  Data  reprinted  with  permission  from  Bliss  (1935). 
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Figure  7.1  Proportion  of  beetles  killed  versus  log  dosage,  with  fits  of  probit  and  complementary  log-log  models. 


concentrations  with  a  separate  one  for  the  third  through  eighth  concentration.  We  next 
consider  an  alternative  model  that  gives  a  good  fit  to  all  eight  concentrations  at  once. 


7.1.5  Complementary  Log-Log  Link  Models 

The  logit  and  probit  links  are  symmetric  about  0.50,  in  the  sense  that 

link[7r(x)]  =  — link[  1  —  7r(x)]. 


To  illustrate, 


logit[7r(x)]  =  log[7r(x)/(l  -  7r(x))] 

=  -log[(l  -  n(x))/n(x)\  =  — logit[  1  -  7 r(x)]. 

This  means  that  the  response  curve  for  n(x)  has  a  symmetric  appearance  about  the  point 
where  tt(x)  =  0.50,  with  tt(x)  approaching  0  at  the  same  rate  that  it  approaches  1 .  Logistic 
models  and  probit  models  are  inappropriate  when  this  is  badly  violated. 

The  response  curve 


n(x)  =  1  —  exp[-  exp(cr  +  fix)]  (7.3) 

has  the  shape  shown  in  Figure  7.2.  It  is  asymmetric,  7r(x)  approaching  0  fairly  slowly  but 
approaching  1  quite  sharply.  For  this  model, 

log[  log(  1  -  7 r(x))]  =  a  +  fix. 

The  link  function  for  this  GLM  is  called  the  complementary  log-log  link,  since  the  log-log 
link  applies  to  the  complement  of  7r(x)  (Yates  1955). 
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Figure  7.2  Binary  regression  model  with  complementary  log-log  link  function. 


To  interpret  model  (7.3),  we  note  that  at  xt  and  x2, 

log[—  logC  1  -  JI(X2))]  -  log[-log(l  -  Tl{x\))]  =  P(x2  -  .Yi), 


so  that 


log[  1  -7T(-Y2)] 
log[  1  -  7T(. Vi)] 


=  exp[/6(x2  —  jy,)] 


and 


1  7r(x2)  =  [1  -  7T(.Y,)]eXpl^<V2-V|)]. 

For  x2  —  X\  =  1 ,  the  complement  probability  at  x2  equals  the  complement  probability  at  ,Y| 
raised  to  the  exp(/0)  power.  As  y  increases,  the  curve  is  monotone  increasing  when  fi  >  0. 
A  related  model  to  (7.3)  is 

7t(.y)  =  exp[—  exp(a  +  /S.y)].  (7.4) 

In  GLM  form  it  uses  the  log-log  link  function, 

log[-  log(7r(.Y))]  =  a  +  fix. 

For  it,  n(x)  approaches  0  sharply  but  approaches  1  slowly.  As  x  increases,  the  curve  is 
monotone  increasing  when  /3  <  0.  When  the  log-log  model  holds  for  the  probability  of  a 
success,  the  complementary  log-log  model  holds  for  the  probability  of  a  failure. 

Model  (7.4)  with  log-log  link  is  the  special  case  of  (7.2)  with  cdf  of  the  type  l  extreme 
value  (or  Gumbel)  distribution.  The  cdf  equals 

F(x)  —  exp{— exp[-(x  -  a)/b]) 

for  parameters  b  >  0  and  — oo  <  a  <  oo.  It  has  mode  a ,  mean  a  +  0.511b,  standard  devi¬ 
ation  nb/y/6  =  1.283ft,  and  is  highly  skewed  to  the  right.  The  term  extreme  value  refers 
to  this  being  the  limit  distribution  of  the  maximum  of  a  sequence  of  independent  and  iden¬ 
tically  distributed  continuous  random  variables.  Models  with  log-log  links  can  be  fitted 
using  the  Fisher  scoring  algorithm  for  GLMs. 
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7.1.6  Example:  Beetle  Mortality  Revisited 

For  the  flour  beetle  mortality  data  (Table  7. 1 ),  the  complementary  log-log  model  has  ML 
estimates  a  =  —39.57  and  P  =  22.04.  At  dosage  y  =  1.70,  the  fitted  probability  of  survival 
is  1  —  ir(x)  =  exp{— exp[— 39.57  +  22.04(1.70)]}  =  0.885,  whereas  at  x  =  1.80  it  is  0.330 
and  at  x  —  1.90  it  is  4  x  10  5.  The  probability  of  survival  at  dosage  x  +0.10  equals 
the  probability  at  dosage  x  raised  to  the  exp(22.04  x  0.10)  =  9.06  power.  For  instance, 
0.330  =  (0.885)906.  Table  7.1  shows  the  fitted  values  and  Figure  7.1  shows  the  fit.  They 
are  close  to  the  observed  death  counts  ( G 2  =  3.45,  df  =  6).  The  fit  seems  adequate. 

The  models  with  different  link  functions  are  not  nested  so  cannot  be  compared  with 
standard  likelihood-ratio  tests.  The  AIC  values  are  41 .3  for  the  logit  link,  40.2  for  the  probit 
model,  33.7  for  the  complementary  log-log  link,  and  57.8  for  the  log-log  link.  These  show 
a  clear  preference  for  the  complementary  log-log  link.  Aranda-Ordaz  (1981)  and  Stukel 
(1988)  proposed  generalized  link  functions  and  also  analyzed  these  data. 

7.2  BAYESIAN  INFERENCE  FOR  BINARY  REGRESSION 

Bayesian  modeling  of  binary  response  variables  provides  an  alternative  to  the  frequentist 
modeling  of  Chapters  5  and  6.  Our  main  focus  here  is  on  the  probit  and  logistic  regression 
models. 

7.2.1  Prior  Specifications  for  Binary  Regression  Models 

Models  can  have  many  parameters,  and  a  researcher  may  have  more  prior  information 
about  some  of  them  than  others.  One  simplistic  approach  takes  the  prior  distribution  for  j8 
to  be  constant  over  the  multidimensional  space  of  all  possible  parameter  values.  Then,  the 
posterior  distribution  is  a  constant  multiple  of  the  likelihood  function.  That  is,  the  posterior 
distribution  is  a  scaling  of  the  likelihood  function  so  that  it  integrates  out  to  1 .  The  mode 
of  the  posterior  distribution  is  then  the  ML  estimate.  When  the  sample  size  is  small  or 
the  data  are  unevenly  distributed  among  the  categories,  the  posterior  distribution  may  be 
quite  skewed  rather  than  approximately  normal.  In  such  cases,  the  posterior  mean  can  be 
quite  different  from  the  posterior  mode  and  thus  from  the  ML  estimate. 

Effect  parameters  in  binary  regression  models  can  take  value  over  the  entire  real  line. 
Then,  such  a  flat  prior  distribution  is  improper,  not  integrating  out  to  1  over  the  space  of 
possible  parameter  values.1  A  danger  with  improper  prior  distributions  is  that  posterior 
distributions  can  also  be  improper  for  some  models  (Natarajan  and  McCulloch  1995).  A 
Markov  chain  Monte  Carlo  (MCMC)  algorithm  for  approximating  the  posterior  distribution 
may  fail  to  recognize  that  the  posterior  distribution  is  improper.  Thus,  it  is  safer  to  use  a 
proper  but  relatively  diffuse  prior  if  you  prefer  the  prior  distribution  to  be  flat  relative  to  the 
likelihood  function. 

Considerable  flexibility  for  a  prior  for  p  is  provided  by  a  multivariate  normal  density. 
A  simple  uninformative  prior  takes  each  mean  to  be  0,  with  a  large  standard  deviation. 
If  you  use  a  common  TV (0,  ct2)  prior  for  each  parameter,  it  is  sensible  to  standardize  the 
explanatory  variables  (e.g.,  with  means  of  0  and  standard  deviations  of  1 )  so  that  the  effects 
are  comparable  in  interpretation.  Otherwise,  take  the  scale  into  account:  For  example,  if 


'For  example,  for  a  single  binomial  parameter  tt,  an  improper  uniform  density  for  logit(jr)  corresponds  to  an 
improper  beta  prior  for  n  with  a\  =  ari  =  0. 


258 


ALTERNATIVE  MODELING  OF  BINARY  RESPONSE  DATA 


x  =  time  is  rescaled  from  years  to  months,  the  new  parameter  is  -p^th  as  large,  so  a  in  the 
normal  prior  should  be  multiplied  by  A  compared  to  when  x  is  measured  in  years. 

Using  large  a  in  normal  priors  for  j8  implies  priors  on  the  probability  scale  that  are 
highly  U-shaped,  with  about  half  the  probability  very  close  to  0  and  half  very  close  to  1 . 
This  seems  intuitively  to  be  rather  informative,  but  in  fact  such  priors  have  little  influence, 
with  the  posterior  looking  much  like  the  likelihood  function.  You  could  instead  select  a  so 
that  the  prior  on  an  induced  probability  scale  is  close  to  uniform.  With  a  single  parameter, 
this  is  true  except  near  the  boundaries  when  er  %  1 .5;  using  cr  =  1 .69  matches  the  normal 
to  a  uniform  prior  in  the  first  two  moments. 

Some  data  analysts  prefer  a  subjective  Bayesian  approach  whereby  prior  distributions 
represent  prior  beliefs  about  f).  For  example,  instead  of  using  fi  =  0  and  a  very  large  a  for 
a  N(/i,  a2)  prior  distribution,  you  could  take  fi  and  a  such  that  n  ±  3cr  contains  all  values 
that  have  any  plausibility  for  the  parameter.  If  appropriate,  you  can  also  include  correlation 
in  the  prior  distribution  between  different  parameters. 

In  practice,  it  is  not  obvious  how  to  specify  the  hyperparameters  for  normal  prior  distri¬ 
butions  for  p.  Data  analysts  think  more  easily  in  terms  of  plausible  values  for  probabilities 
rather  than  for  model  parameters  that  pertain  to  a  nonlinear  function  of  the  probabilities 
such  as  effects  on  the  log  odds.  Alternatively,  you  can  construct  a  prior  distribution  on 
the  probability  scale  rather  than  a  link  function  scale  such  as  the  logit,  as  we’ll  explain  in 
Section  7.2.4.  Or,  many  Bayesians  prefer  to  use  the  Jeffreys  prior,  because  of  its  invariance 
to  the  parameterization  and  other  desirable  properties.  This  prior  density  function  relates 
to  the  information  matrix,  being  proportional  to  \J\X/2.  For  binomial  regression  models, 
Ibrahim  and  Laud  (1991)  and  Chen  et  al.  (2008)  showed  that  the  Jeffreys  prior  is  proper. 
With  logit  and  probit  link  functions,  this  prior  is  symmetric  and  unimodal  at  0.  It  and  the 
corresponding  posterior  have  thinner  tails  than  any  multivariate  t  distribution,  and  this  holds 
also  for  the  complementary  log-log  link. 

The  next  example  illustrates  the  potential  impact  of  the  choice  of  prior  distribution.  At 
this  stage,  we  will  not  worry  about  the  technical  details  of  how  to  approximate  the  posterior 
distribution  computationally,  leaving  this  to  Section  7.2.6. 

7.2.2  Example:  Risk  Factors  for  Endometrial  Cancer  Grade 

Heinze  and  Schemper  (2002)  described  a  study  about  endometrial  cancer  in  which  the 
purpose  was  to  describe  y  =  histology  of  79  cases  (0  =  low  grade  for  30  patients,  1  =  high 
grade  for  49  patients)  in  terms  of  three  supposed  risk  factors:  x\  =  neovasculation  (1  = 
present  for  1 3  patients,  0  =  absent  for  66  patients),  X2  =  pulsatility  index  of  arteria  uterina 
(ranging  from  0  to  49),  andx'3  =  endometrium  height  (ranging  from  0.27  to  3.61).  Table  7.2 
shows  some  of  the  data.  The  complete  data  set  is  available  at  the  text  web  site. 


Table  7.2  Part  of  Endometrial  Cancer  Data  Set° 


HG 

NV 

PI 

EH 

HG 

NV 

PI 

EH 

HG 

NV 

PI 

EH 

0 

0 

13 

1.64 

0 

0 

16 

2.26 

0 

0 

8 

3.14 

1 

1 

21 

0.98 

1 

0 

5 

0.35 

1 

1 

19 

1.02 

a  HG  =  histology  grade,  NV  =  neovasculation,  PI  =  pulsatility  index,  EH  =  endometrium  height. 

Source:  Data  courtesy  of  Michael  Schemper  and  Georg  Heinze.  Complete  data  (n  =  79)  at  www.  stat . 
uf 1 . edu/~aa/ cda/ cda . html . 
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For  these  data,  we  consider  the  main-effects  model 


logit[P(T  =  1)]  =  a  +  foxx  +  fox2  +  fox?,. 


using  standardized  versions  of  aa  and  x3.  For  all  13  patients  having  X]  =  1,  the  outcome 
is  y  =  1.  There  is  quasi-complete  separation,  and  the  ML  estimate  fo  =  00.  The  95% 
profile  likelihood  confidence  interval  for  ySi  is  (1.28,  00).  Apparently,  neovasculation  is  an 
important  risk  factor,  so  it  is  not  sensible  to  drop  it  from  the  model  because  of  its  infinite 
estimate.  With  the  Bayesian  approach,  the  estimate  of  fo  is  finite. 

In  our  Bayesian  analyses,  we  use  independent  N(/x,  a2)  prior  distributions  for  the  model 
parameters.  To  reflect  a  lack  of  prior  belief  about  the  direction  of  the  effects,  we  took  each 
ix  =  0.0.  Instead  of  the  usual  (0,  1 )  coding  for  the  indicator  variable  x\,  we  let  it  take  values 
—0.5  and  0.5.  The  prior  distribution  is  then  symmetric  in  the  sense  that  the  logits  for  each 
group  have  the  same  prior  variability  as  well  as  the  same  prior  means,  yet  fo  still  has  the 
usual  interpretation  of  a  conditional  log  odds  ratio. 

For  these  data,  because  the  log  likelihood  is  so  flat  in  the  dimension,  posterior  means 
for  fo  can  be  quite  different  for  different  prior  distributions.  To  reflect  a  lack  of  information 
about  the  sizes  of  the  effects,  we  first  took  the  prior  distributions  to  be  quite  diffuse,  with 
a  =  10.  The  analysis  can  be  implemented  with  Bayesian  software  such  as  WinBUGS  or 
ordinary  software  that  has  a  Bayes  option  (such  as  SAS  PROC  GENMOD,  as  shown  at  the 
text  website),  using  an  MCMC  algorithm  to  approximate  the  posterior.  Table  7.3  shows 
posterior  means,  standard  deviations,  and  95%  equal-tail  posterior  intervals,  based  on  an 
MCMC  process  with  1,000,000  iterations.  Chains  were  run  with  various  starting  values, 
and  gave  similar  results.  With  such  a  long  process,  the  Monte  Carlo  standard  errors  for  the 
approximations  to  the  Bayes  estimates  were  negligible  (about  0.01 ).  Table  7.3  also  shows 
the  ML  results,  for  comparison. 

Consider  fo,  for  which  the  ML  estimate  is  infinite.  Based  on  the  posterior  mean,  the 
estimated  odds  of  the  higher  grade  histology  when  neovasculation  is  present  are  exp(8. 93)  = 
7555  times  the  estimated  odds  when  neovasculation  is  absent.  The  95%  equal-tail  posterior 
interval  for  fo  is  (2. 1 1, 20.14).  This  provides  the  inference  that  /Si  >  0  and  the  effect  seems 
to  be  large.  The  estimated  size  of  the  effect  is  imprecise,  because  of  the  flat  log  likelihood 
and  the  relatively  disperse  priors.  Inferences  about  the  model  parameters  were  substantively 
the  same  as  with  the  ML  frequentist  analysis. 

For  further  comparison,  we  used  more  informative  prior  distributions.  To  reflect  a 
stronger  belief  that  the  effects  are  not  extremely  strong,  we  took  the  prior  standard  de¬ 
viations  to  be  1 .0.  Then  nearly  all  the  prior  probability  mass  for  the  conditional  odds  ratio 
exp(/l|)  falls  between  exp(-3.0)  =  0.05  and  exp(3.0)  =  20.  Results  were  quite  different 
than  with  the  ML  frequentist  analysis  or  the  Bayesian  analysis  with  a  =  10.  The  posterior 


Table  7.3  Results  of  Fitting  Models  to  Cancer  Data  Set  of  Table  7.2“ 


Analysis 

fo 

SD 

Interval 

fo 

SD 

Interval 

fo 

SD 

Interval 

ML 

OO 

— 

(1.3,  00) 

-0.42 

0.44 

(-1.4,  0.4) 

-1.92 

0.56 

(-3.2,  -1 

•0) 

Bayes,  a  =  10 

9.12 

5.10 

(2.1,21.3) 

-0.47 

0.45 

(-1.4,  0.4) 

-2.14 

0.59 

(-3.4,  -1 

•  1) 

Bayes,  a  =  1 

1.65 

0.69 

(0.3,  3.0) 

-0.22 

0.33 

(-0.9,  0.4) 

-1.77 

0.43 

(-2.7,  -1 

■0) 

“Interval  is  profile  likelihood  interval  for  ML  and  equal-tail  posterior  interval  for  Bayes. 
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mean  for  /3\  is  now  1.65  instead  of  8.93.  Because  y  —  1  for  all  13  patients  having  X\  =  1, 
the  frequentist  approach  tells  us  we  cannot  rule  out  any  very  large  value  for  .  By  contrast, 
if  we  had  strong  prior  beliefs  that  \fi\  |  <  3,  then  even  with  these  sample  results  the  Bayesian 
posterior  inference  has  an  upper  bound  of  about  3  for  . 

Corresponding  to  the  frequentist  P-value  for  Ha\  /3j  >  0,  the  Bayesian  approach  provides 
the  posterior  probability  that  <  0.  This  is  0.002  for  the  Bayesian  analysis  with  <7  =  10; 
that  is,  0.0  is  the  0.2  percentile  of  the  posterior  distribution.  For  this  relatively  flat  prior 
distribution,  this  posterior  tail  probability  is  similar  to  the  P-value  of  0.001  for  the  one¬ 
sided  frequentist  likelihood-ratio  test  of  Hq\  =  0  against  Ha\  j3  >  0,  thus  giving  very 
strong  evidence  that  fi  >  0.  The  posterior  P(ft\  <  0)  =  0.007  for  the  informative  prior 
with  a  =  1 .0.  With  the  more  informative  prior  distribution  centered  at  a  lack  of  a  treatment 
effect,  this  posterior  probability  provides  a  bit  less  evidence  of  a  treatment  effect. 

Similar  substantive  results  occur  with  corresponding  probit  models.  For  comparable 
results,  prior  a  values  should  be  divided  by  about  1 .6  to  1 .8  to  reflect  the  smaller  variability 
on  the  probit  link  scale  compared  with  the  logit  link. 

7.2.3  Bayesian  Logistic  Regression  for  Retrospective  Studies 

The  frequentist  ML  equivalence  between  prospective  and  retrospective  logistic  models 
has  analogs  for  Bayesian  methods.  A  key  reference  is  Seaman  and  Richardson  (2004). 
Their  retrospective  likelihood  was  combined  with  a  Dirichlet  prior  distribution  on  exposure 
probabilities  (for  a  discrete  exposure  variable  or  set  of  variables)  in  the  control  group. 
Their  prospective  likelihood  was  combined  with  an  improper  uniform  prior  distribution 
for  the  log  odds  that  an  individual  with  baseline  exposure  is  diseased.  They  showed  that 
the  posterior  distribution  of  log  odds  ratios  is  equivalent  for  the  two  approaches.  Ghosh 
and  Mukherjee  (2010)  surveyed  Bayesian  work  on  case-control  studies.  Topics  of  in¬ 
terest  include  measurement  error,  handling  missingness,  and  flexibility  for  hierarchical 
structures. 

7.2.4  Probability-Based  Prior  Specifications  for  Binary  Regression  Models 

As  an  alternative  to  relying  on  normal  prior  distributions  on  a  link  function  scale  such 
as  the  logit  or  probit,  you  can  construct  a  prior  distribution  on  the  probability  scale.  The 
chosen  prior  distribution  then  induces  a  corresponding  prior  distribution  for  the  model 
parameters.  Moreover,  prior  distributions  for  models  with  different  link  functions  are  then 
comparable,  even  though  they  are  not  identical  on  the  scale  of  the  parameters  in  the  linear 
predictor. 

This  approach  requires  selecting  prior  distributions  for  at  least  as  many  probability 
values  as  there  are  parameters  in  the  model.  Suppose  we  choose  M  settings  of  explanatory 
variable  values  for  placing  prior  distributions  on  the  response  probabilities.  At  setting  s, 
denoted  by  x(.5)  with  P(Y  =  I )  =  y(s)  at  that  point,  we  select  a  prior  distribution  (such  as  a 
beta  distribution)  for  y(s).  One  way  to  indirectly  determine  the  hyperparameters  for  a  beta 
distribution  is  to  guess  two  relevant  values  of  the  distribution,  such  as  its  mean  and  its 
standard  deviation  or  a  percentile  such  as  the  95th  percentile.  Those  values  then  determine 
the  beta  indices.  The  sum  of  the  two  beta  indices  for  y(i)  corresponds  to  a  particular 
number  Ks  of  “prior  observations”  that  the  prior  belief  represents.  That  is,  at  setting  5  the 
beta  prior  density  for  y(S)  has  parameters  Ksgs  and  Ks ( 1  —  gs ),  corresponding  to  mean 
gs.  Alternatively,  we  might  specify  the  prior  information  using  gs  and  Ks.  For  example, 
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we  might  indicate  that  our  guess  for  yfu  is  gs  =  0.30  and  that  this  prior  information  is 
relatively  vague,  being  comparable  to  K s  =  2  prior  observations,  in  which  case  the  prior 
beta  hyperparameters  are  0.60  and  1.40.  This  type  of  prior  is  sometimes  referred  to  as  a 
data  augmentation  prior  (Christensen  et  al.  2010,  Sec.  8.4). 

For  simplicity,  we  treat  these  M  prior  distributions  for  the  probabilities  as  independent. 
Then,  the  joint  prior  density  function  in  terms  of  these  M  probabilities  is 

g(Y( i).  ■  •  •  >  V(M))  II  ~  K(.o)/C’<l“s'>_l- 


Suppose  the  link  function  for  the  model  corresponds  to  the  inverse  of  the  cdf  <t>,  such  as 
standard  normal  for  the  probit  link  and  standard  logistic  for  the  logit  link.  Let  (p  denote  the 
corresponding  pdf.  Then,  in  terms  of  the  model  parameters  0r  =  (a,  /L ,  /L,  •  ■  •).  this  prior 
density  function  corresponds  to 


^(/S)  oc  |_[{<t>(/37'jc(,))^<f*-,[l 

S 


<t>(j8r*(.S))]*'<l~ft)~'  x  <P(PT xis))}- 


7.2.5  Example:  Modeling  the  Probability  a  Trauma  Patient  Survives 

Bedrick  et  al.  (1997)  and  Christensen  et  al.  (2010,  Chap.  8)  illustrated  this  approach  using 
data  from  300  patients  admitted  to  the  University  of  New  Mexico  Trauma  Center  between 
1991  and  1994.  The  response  variable  Y  was  whether  the  patient  died  (1  =  yes,  0  =  no). 
The  explanatory  variables  were  jci  =  injury  severity  score  (taking  values  between  0  and  75), 
,\'2  =  trauma  score  based  on  a  weighted  average  of  several  measurements  such  as  systolic 
blood  pressure  and  respiratory  rate  (taking  values  from  0  for  no  vital  signs  to  7.84  for 
normal  vital  signs),  A3  =  age,  and  a4  =  type  of  injury  (1  =  penetrating,  such  as  a  gunshot 
or  knife  wound;  0  =  blunt,  such  as  a  car  crash).  The  authors  used  a  model  that  permitted 
the  effect  of  type  of  injury  to  vary  by  age. 


logit[P(T  —  1)]  —  a  4-  /i  1  a  1  4-  /La 2  4-  /La 3  +  /La4  +  /L(A3A4). 


The  data  are  available  at  the  website  for  the  Christensen  et  al.  (2010)  text.2  Of  the  225 
patients  with  blunt  injuries,  17  died,  whereas  of  the  75  patients  with  penetrating  injuries, 
5  died. 

To  help  in  selecting  prior  distributions,  the  authors  elicited  percentile  values  for 
P(Y  =  1)  from  the  trauma  surgeon  who  provided  the  data,  at  six  locations  for  settings  of 
the  explanatory  variables.  For  example,  the  first  location  was  A|  =  25,  A2  =  7.84,  A3  =  60, 
and  a4  =  0,  representing  a  person  with  normal  vital  signs  who  was  not  badly  hurt.  There, 
the  chosen  prior  was  beta(l.l,  8.5),  which  has  mean  0.1 1  and  standard  deviation  0.10. 
By  contrast,  the  third  of  the  six  locations,  with  Ai  =  41,  X2  =  3.34,  A3  =  60,  and  a4  =  1, 
had  a  considerably  more  severe  injury  score  and  poorer  trauma  score.  The  beta(5.9,  1.7) 
prior  chosen  there  has  mean  0.78  and  standard  deviation  0.14.  The  six  priors  are  highly 
informative  (perhaps  too  much  so),  corresponding  to  adding  57.5  observations  to  the  data 
when  regarded  as  data  augmentation  priors. 

2  As  of  2012,  see  www .  ics  .  uci  .  edu/~wj  ohnson/BIDA/ Ch8/ trauma 3  00  .  txt. 
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These  six  prior  distributions  induce  prior  distributions  for  the  logistic  model  parameters. 
With  data  augmentation  priors  such  as  these  beta  distributions,  the  posterior  distribution 
has  the  shape  of  the  likelihood  function  for  the  augmented  data  set.  So,  we  can  use  standard 
frequentist  software  to  find  the  posterior  mode  by  finding  the  ML  estimate  for  the  augmented 
data. 

For  the  chosen  prior  distributions,  Table  7.4  lists  the  posterior  means  and  standard 
deviations  of  the  logistic  model  parameters,  as  well  as  the  corresponding  ML  estimates 
and  their  standard  errors.  The  Bayes  estimates  differ  somewhat  in  magnitude  from  the  ML 
estimates,  reflecting  the  informative  prior  distributions.  However,  substantive  conclusions 
are  similar.  There  is  some  indication  that  injury  type  has  more  of  an  effect  at  younger  ages 
(e.g.,  the  Bayes  estimate  of  injury  type  is  0.9  at  age  10  and  about  0  at  age  55),  although  the 
interaction  is  not  statistically  significant. 

This  approach  is  appealing  as  a  way  of  eliciting  subjective  priors  that  are  interpretable 
on  the  probability  scale.  Some  would  find  the  resulting  overall  prior  for  the  parameters  as 
too  highly  informative.  For  comparison.  Table  7.4  also  shows  results  from  using  relatively 
flat  independent  normal  priors  (each  with  a  =  10).  These  are  more  similar  to  those  using 
ML.  Alternatively,  we  could  use  the  Bedrick  et  al.  (1997)  approach  of  setting  priors  on  the 
probability  scale  to  induce  those  for  the  logistic  parameters,  but  select  those  probability 
priors  to  be  less  informative. 

With  the  beta  priors,  Bedrick  et  al.  (1997)  reported  that  the  posterior  P{fi\  <  0)  <  0.01. 
This  corresponds  to  a  frequentist  P-value  for  testing  Hq:  fi\  =  0  against  Hq:  fi\  >  0. 
Not  surprisingly,  there  is  strong  evidence  that  higher  injury  scores  correspond  to  higher 
probabilities  of  non-survival. 

At  any  particular  setting  of  the  explanatory  variables,  the  posterior  predictive  value  of 
P(Y  =  1)  is  found  by  integrating  the  logistic  expression  for  this  probability  with  respect 
to  the  posterior  distribution  for  the  model  parameters.  This  gives  a  Bayesian  posterior 
estimate  of  the  probability  of  death  at  that  setting.  Bedrick  et  al.  (1997)  regarded  this  as 
more  important  than  estimating  the  model  parameters.  Such  estimates  can  be  portrayed 
graphically  as  a  way  of  describing  effects.  For  example,  Figure  7.3  graphs  the  Bayes 
estimate  of  the  probability  of  death  as  a  function  of  the  injury  severity  score  (ISS),  for 
each  type  of  injury,  for  subjects  with  a  trauma  score  (Rts)  of  3.34  and  ages  of  10  and 
60.  This  portrays  how  injury  type  has  more  of  an  effect  at  a  younger  age.  Integrating  the 
logistic  curve  by  the  posterior  distribution  yields  a  curve  that  does  not  have  exactly  the 


Table  7.4  Bayesian  and  ML  Fit  of  Logistic  Regression  Model  for  Trauma  Data 


Variable 

Bayesian, 

Beta  Priors 

Bayesian, 

Normal  Priors 

Frequentist  ML 

Mean 

Std.  dev. 

Mean 

Std.  dev. 

Estimate 

SE 

Intercept 

-1.79 

1.10 

-2.02 

1.57 

-2.061 

1 .526 

Injury  score 

0.07 

0.02 

0.09 

0.03 

0.083 

0.028 

Trauma  score 

-0.60 

0.14 

-0.60 

0.18 

-0.553 

0.171 

Age 

0.05 

0.01 

0.05 

0.02 

0.05 1 

0.017 

Injury  type 

1.10 

1.06 

1.44 

1.41 

1.338 

1.334 

Age  x  Injury  type 

-0.02 

0.03 

-0.01 

0.03 

-0.005 

0.032 

Source:  Results  with  beta  priors  based  on  Table  2  in  Bedrick  et  al.  (1997). 
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Age  =  60  and  Rts  =  3.34 


Age  =  10  and  Rts  =  3.34 


ISS 

(b) 


Figure  7.3  Bayesian  estimate  of  probability  of  dying  as  function  of  injury  severity  score  and  injury  type,  for 
subjects  with  trauma  score  =  3.34  and  ages  =  10  and  60.  Source :  Reprinted  with  permission  from  Christensen 
etal.  (2010.  p.  191). 


logistic  formula,  but  has  similar  appearance.  As  in  a  frequentist  analysis,  it  is  also  possible 
to  provide  interval  estimates  for  P{Y  =  1). 

Bedrick  et  al.  (1997)  studied  the  sensitivity  of  the  results  to  deleting  individual  obser¬ 
vations  and  to  changes  in  the  prior  distribution.  One  way  to  summarize  this  is  in  terms 
of  changes  in  the  predictive  value  of  P(Y  —  1)  at  the  various  explanatory  settings  for  the 
observations.  They  also  investigated  link  selection,  by  comparing  models  in  terms  of  a 
Bayes  factor  ( BF ).  The  BF  is  formed  as 

BF  =  p(y\Mi)/p(y\M2), 

where  p(y\M)  is  the  probability  of  the  observed  data  under  model  M.  For  a  given  model 
M,  p(y\M)  is  obtained  by  integrating  the  likelihood  function  for  that  model  with  respect 
to  the  induced  prior  on  ft  for  that  model,  thus  giving  a  marginal  likelihood.  For  these  data, 
this  Bayes  factor  was  about  1 .0  when  comparing  models  with  logit  and  probit  link,  but 
about  21  when  comparing  each  of  these  to  the  model  with  complementary  log-log  link. 
For  example,  for  the  chosen  prior  distributions,  the  probability  of  the  observed  data  under 
the  logistic  model  was  20.7  times  the  probability  of  the  same  data  with  the  complementary 
log-log  link. 

7.2.6  Bayesian  Fitting  for  Probit  Models 

We  now  summarize  the  basic  ideas  of  Bayesian  model  fitting  of  binary  regression  models 
with  normal  priors,  in  the  context  of  probit  models.  Albert  and  Chib  (1993)  showed  that 
a  simple  analysis  is  possible  in  the  probit  case  using  Gibbs  sampling  based  on  the  normal 
threshold  latent  variable  model  presented  in  Section  7.1.1.  This  model  is  simpler  to  handle 
than  the  logistic  model,  because  results  apply  from  Bayesian  inference  for  ordinary  normal 
linear  regression  models.  Albert  and  Chib  assumed  a  multivariate  normal  prior  distribution 
for  the  regression  parameters  and  independent  normal  latent  variables.  Then,  the  posterior 
distribution  of  the  regression  parameters,  conditional  on  the  observed  data  and  the  latent 
variables,  is  multivariate  normal.  Implementation  of  MCMC  methods  is  relatively  simple 
because  the  Monte  Carlo  sampling  is  from  a  normal  distribution. 

For  subject  i,  a  latent  variable  y*  is  assumed  to  relate  to  the  response  y,  by  y,  =  1  if 
y*  >  0  and  y,  =  0  if  y*  <  0.  (We  use  the  data  in  this  section  in  ungrouped  form,  so  that  all 
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n,  =  1 .)  We  assume  that  Y*  has  a  N(ftTx, .  1)  distribution,  where  x-,  =  ( 1 ,  Xj\ ,  . . . ,  Xjp)T . 
Thus, 


PiXi  =  1)  =  P(Y*  >  0)  =  P(pT Xi  +  e  >  0), 


where  e  is  a  N( 0,  1)  random  variable.  But  then  for  the  standard  normal  cdf  <t>,  this  corre¬ 
sponds  to  the  probit  model 


<t>-'[P(Y,  =  l)]  =  pTx,. 

Albert  and  Chib  noted  that  if  {_y*j  were  observed  and  a  multivariate  normal  prior  were 
chosen  for  /?,  then  the  posterior  distribution  for  P  results  from  ordinary  normal  linear 
model  results.  Given  the  binary  response  y, ,  y*  is  left  or  right  truncated  at  0,  however. 
Thus,  their  distribution  follows  a  truncated  normal  distribution,  but  it  is  still  possible  to  use 
Gibbs  sampling  to  simulate  the  exact  posterior  distribution. 

The  likelihood  function  can  be  constructed  in  terms  of  the  model  for  the  underlying  latent 
observation,  y* .  If  y*  were  observed,  the  contribution  to  the  likelihood  function  would  be 
<p(y*  —  p1  Xi),  where  (/>(•)  is  the  standard  normal  pdf.  Now,  regarding  y*  as  unknown  except 
for  its  sign,  the  contribution  to  the  likelihood  function  is 

[l(y*  >  0)v' /( v*  <  O)'^]0(y;  -  Prx,), 

where  /  is  the  indicator  function.  For  n  independent  observations,  the  likelihood  function  is 
proportional  to  the  product  of  n  such  terms.  Then,  for  prior  density  function  g(P),  the  joint 
posterior  density  of  ft  and  of  {_y* }  given  the  data  { v, }  is  proportional  to 

>  0>v'/(.y;  <  O)'->  ]0(^r  -  PTXi). 

i 

With  the  ML  estimates  as  initial  values,  Albert  and  Chib  used  a  Gibbs  sampling  scheme 
that  successively  samples  from  the  density  of  y*  =  (y*,  . . . ,  y*)T  given  f)  and  of  fi  given 
y*.  With  the  conjugate  normal  prior,  they  noted  that  the  posterior  density  of  p  given  y*  is 
normal.  Specifically,  suppose  that  the  prior  distribution  of  P  is  N{P o,  Eo).  and  let  X  be  the 
matrix  with  /th  row  xj ,  so  the  latent  variable  model  is  y*  —  Xp  +  e.  Conditional  on  y *, 
the  distribution  of  p  is  N(p,  Y.)  with 

p  =  (r0 1  +xTxr'a-'p0  +  xTy*),  i  =(Z0-'  +xTxy'. 

Conditional  on  p,  the  elements  of  y*  are  independent  with  density  of  y*  being  N(p ’  Xj ,  1) 
truncated  at  the  left  by  0  if  y,  —  1  and  truncated  at  the  right  by  0  if  y,  =  0.  The  model 
fitting  yields  posterior  means  for  the  Bayes  estimates  of  parameters,  and  posterior  standard 
deviations  that  describe  the  precisions  of  those  estimates. 

Albert  and  Chib  also  used  a  link  function  corresponding  to  the  cdf  of  a  t  distribution, 
to  investigate  the  sensitivity  of  fitted  probabilities  of  response  categories  to  the  choice  of 
link  function.  This  approach  provides  the  Cauchy  link  when  df  =  1  and  the  probit  link 
as  df  — >  oo.  It  also  provides  close  approximations  to  results  for  corresponding  logistic 
models,  because  a  t  variate  with  df  =  8  divided  by  0.634  well  approximates  a  standard 
logistic  variate.  They  considered  the  t  link  case  through  a  latent  variable  model  using  a 
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scale  mixture  of  normal  distributions.  They  also  considered  a  hierarchical  analysis  that 
specifies  priors  for  the  hyperparameters  of  a  normal  prior  distribution  for  /?. 


7.2.7  Bayesian  Model  Checking  for  Binary  Regression 

Model  checking  methods  can  investigate  various  aspects  of  the  chosen  model.  Many  of  these 
methods  parallel  frequentist  methods  for  model  checking.  For  example,  sensitivity  analyses 
investigate  how  much  posterior  inferences  change  when  alternative  reasonable  models  are 
used.  Case  deletion  diagnostics  summarize  the  influence  of  individual  observations.  Bayes 
factors  can  be  formed  to  compare  different  link  functions  in  terms  of  posterior  odds  for  a 
pair  of  models. 

If  the  model  is  adequate,  new  data  sets  generated  from  the  model  should  look  like  the 
observed  data.  Analogs  of  test  statistics  compare  the  observed  data  to  predictive  simulations 
based  on  the  model.  Analogs  of  F-values  find  the  probability  that  replicated  data  are  more 
extreme  than  the  observed  data. 

Details  of  model  checking  methods  are  beyond  the  scope  of  this  text.  See  Christensen 
et  al.  (2010,  Sec.  8.3),  Gelman  et  al.  (2004,  Chap.  6),  and  Spiegelhalter  et  al.  (2002)  and 
references  therein.  The  Spiegelhalter  et  al.  (2002)  article  proposed  a  complexity  measure  for 
the  effective  number  of  parameters  in  a  model.  They  also  proposed  a  deviance  information 
criterion  (DIC)  for  comparing  models  as  a  Bayesian  analog  of  AIC.  The  DIC  is  based 
on  adding  double  the  effective  number  of  parameters  to  a  mean  posterior  deviance  for 
checking  fit.  For  a  set  of  candidate  models  that  seem  to  adequately  explain  the  data,  the 
model  selected  is  the  one  that  minimizes  DIC. 


7.3  CONDITIONAL  LOGISTIC  REGRESSION 

ML  estimators  of  logistic  regression  model  parameters  perform  well  when  the  sample  size 
n  is  large  compared  with  the  number  of  parameters.  When  n  is  small  or  when  the  number 
of  parameters  grows  as  n  does,  improved  inference  results  using  conditional  maximum 
likelihood. 

The  conditional  likelihood  approach  eliminates  nuisance  parameters  by  conditioning 
on  their  sufficient  statistics.  The  conditional  likelihood  refers  to  a  distribution  defined  for 
potential  samples  that  provide  the  same  information  about  the  nuisance  parameters  that 
occurs  in  the  observed  sample.  We  next  introduce  this  approach,  using  it  later  in  the  text 
(Sections  16.5  and  16.6)  for  other  small-sample  inference  in  contingency  tables.  We’ll  also 
find  it  to  be  useful  for  the  modeling  of  matched-pairs  data  in  Section  11.2  and  in  more 
general  contexts  for  clustered  data  in  Section  13.1,  for  models  in  which  the  number  of 
parameters  grows  as  r  does.  In  this  setting  it  is  an  alternative  to  hierarchical  Bayesian  and 
frequentist  random  effects  approaches  that  reduce  the  dimension  of  the  parameter  space  by 
assuming  a  probability  distribution  for  parameter  sets  that  grow  with  the  sample  size. 


7.3.1  Conditional  Likelihood 

We  begin  with  a  general  exposition  and  then  present  special  cases.  For  subject  »,  let  yt 
denote  the  binary  response  and  let  .v,;  be  the  value  of  predictor  j,  j  =  1 ,  . . . ,  p.  (For  now. 
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each  y,  refers  to  a  single  trial,  so  n,  =  1.)  The  model  is 


exp  [y,-  (a  +  PjXjj)] 
1  +  exp  (a  +  Yfj=  i  PjXij) 


(7.5) 


Substituting  y ,■  =  1  gives  the  usual  expression,  such  as  (5.16).  Here,  we  explicitly  separate 
the  intercept  from  the  coefficients  of  the  p  predictors.  For  N  independent  observations, 


P(Y  i  =  y\, ...  ,Yn  =  yN)  = 


exP  [  (E,  yi)  a  +  Ey=i  (Si  y>x>j)  h] 

n,  [l  +  exp  (a  +  Y?j=\  PjXij )] 


(7.6) 


The  sufficient  statistic  for  a  is  y,-,  the  total  number  of  successes.  The  sufficient  statistic 
for  Pj  is  X],  ytXij,  j  =  1 , . . . ,  p. 

Usually,  some  parameters  refer  to  effects  of  primary  interest.  Others  may  be  there 
to  adjust  for  relevant  effects,  but  their  values  are  not  of  special  interest.  We  can  elimi¬ 
nate  the  latter  parameters  from  the  likelihood  by  conditioning  on  their  sufficient  statis¬ 
tics.  We  illustrate  by  eliminating  a.  (In  Section  11.2.5  we  will  show  that  for  models 
for  matched  case-control  studies,  there  is  a  large  number  of  intercept  terms  and  they 
cause  difficulties  with  inference  about  the  primary  parameters,  so  it  is  helpful  to  elimi¬ 
nate  them.)  The  sufficient  statistic  for  a  is  JT  y,-,  so  we  condition  on  JT  y,-.  Suppose  that 
yi  =  t.  Denote  the  conditional  reference  set  of  samples  having  the  same  value  of  y,  as 

observed  by 


S(r)  = 


(yl->y*Ny-  = 


With  {y,  }  such  that  y,-  =  t,  the  conditional  likelihood  function  equals 

P(Y\  —  y\, . . .  ,Yn  —  yn) 


P{Y\ 


y  l. 


■  Yn  =yN\'^/yl  =t) 


Y.S(,)p(Y  i=yh--. ,YN  =  yN*) 

exp  [ta  +  (Ei  yjXjj)  PjV  El,  []  +  exP  Q*  +  Ey=i 
Es(„  exP  +  E>=i  (E,  y>ij)  Pi]/  n,  [1  +  exp  (a  +  E;=i  Pjxn)] 
exP  [  Ey=1  (E,  yi*,])  Pj] 

Es< 0 exp  [  E/=,  (E,  ^-Xy)  /*/] ' 


This  does  not  depend  on  a . 

Once  we’ve  obtained  the  conditional  likelihood,  we  use  it  like  an  ordinary  likelihood. 
For  the  parameters  in  it,  their  conditional  ML  estimates  are  the  values  maximizing  it.  Found 
using  iterative  methods,  the  estimators  are  asymptotically  normal  with  covariance  matrix 
equal  to  the  negative  inverse  of  the  matrix  of  second  partial  derivatives  of  the  conditional 
log  likelihood.  Likewise,  we  can  construct  large-sample  Wald,  likelihood-ratio  and  score 
tests  using  approximate  chi-squared  sampling  distributions,  and  we  can  invert  such  tests  to 
construct  confidence  intervals. 
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7.3.2  Small-Sample  Inference  for  a  Logistic  Regression  Parameter 

As  an  alternative  to  large-sample  methods,  we  can  use  the  conditional  distribution  to  perform 
“exact”  inference,  using  permutation  methods  that  consider  the  set  of  all  data  arrays  that 
have  the  fixed  values  for  the  sufficient  statistics  we  condition  upon.  With  it,  probabilities 
such  as  P-values  occur  exactly  rather  than  as  crude  approximations  (Cox  1970). 

For  instance,  suppose  that  inference  focuses  on  j3p  in  model  (7.5).  To  eliminate 

other  parameters,  we  condition  on  their  sufficient  statistics  Tj  =  y,x,y,  j  =  0 . 

p  —  1  (where  jc,o  =  1).  With  an  argument  like  that  just  shown,  we  obtain  the  conditional 
distribution 


P(Y i  =  y\ . Yn  =  yN\Tj  -  tj,  j  =  0 . p  -  1) 

=  exP  [(E,  yjXjp)  Pn\  =  exp (tpPp) 

. v„,)exP  [(5Zt  y*x‘p)  Pr\  JLsuo . exP(r/*^p) 


where 


S(/0,  ...,tp-i)=  {(y*,  •  •  • ,  y*N):  J^y-Xij  =  tj,  j  =  0, . . . ,  p  -  l). 


This  depends  only  on  fip.  Inference  for  fip  uses  the  conditional  distribution  of  its  sufficient 
statistic,  Tp  =  y,jc;p,  given  the  others.  Let  c(/o, . . . ,  tp-\ ,  t)  denote  the  number  of  data 
vectors  in  S(t () . tp-\)  for  which  Tp  =  /.  The  conditional  distribution  of  Tp  is 


P(Tp  =  t\Tj  =0, ...,/)  -  1)  = 


c(?o,  ...,fp-i,0exp(^p) 

E„c'(?o . tp-\,  u)exp(uf)p)' 


(7.7) 


where  the  denominator  summation  refers  to  the  possible  values  u  of  Tp. 

For  testing  //(>:  fip  =  0,  the  conditional  distribution  simplifies.  For  Hcl :  fip  >  0  and 
observed  Tp  =  t0 bS,  the  exact  conditional  P-value  is 


p(Tp  =  t\Tj  =  tjj  =  o,...,p-\)  = 

t>to  bs 


E,>,„hsc('o . tp- i.o 

E„  c'(fo<  ■■■’  tp- 1. «) 


This  is  the  proportion  of  data  configurations  in  the  conditional  set  for  which  the  sufficient 
statistic  for  /);,  is  at  least  as  large  as  observed.  Implementing  this  inference  requires  cal¬ 
culating  {c(/o,  ■  •  ■ ,  tp- 1,  u) }■  For  all  but  the  simplest  problems,  computations  are  intensive 
and  require  specialized  software  (e.g.,  LogXact  of  Cytel  Software  or  PROC  LOGISTIC  in 
SAS). 


7.3.3  Small-Sample  Conditional  Inference  for  2  x  2  Contingency  Tables 

We  illustrate  first  with  a  simple  special  case.  Consider  logistic  regression  with  a  single 
binary  predictor  x. 


logitfPfT,  =  1)]  =  a  +  p.Xj ,  i  —  1, . . . ,  N, 


(7.8) 
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where  a,  =  1  denotes  row  1  and  x,  —  0  denotes  row  2.  The  model  applies  to  a  2  x  2 
table.  The  sufficient  statistic  >’<  f°r  a 's  the  first  column  total.  The  sufficient  statistic 
T  =  Y, x,  for  ft  simplifies  to  the  number  of  successes  in  the  first  row.  Equivalently,  the 
sufficient  statistic  for  the  model  are  the  numbers  of  successes  in  the  two  rows.  Let  t  and  5 
denote  these  binomial  variates.  The  row  totals  n\  and  n2  are  their  indices. 

To  eliminate  a,  we  condition  on  JL  y,-  =  t  +  .v,  the  first  column  total.  Since  N  =  n  \  +  n2 
is  fixed,  so  then  is  the  other  column  marginal  total.  Fisher  (1935c)  showed  that  fixing  both 
sets  of  marginal  totals  yields  noncentral  hypergeometric  probabilities  for  t  that  depend  only 
on  ft. 


f(t\t  +  s\n\,  n2,  ft)  = 


( 

j  e Psi 

£^u=m- 

(*.■) 

(  "2  ) 
ys  +  t  —  u  J 

|  eP“ 

(7.9) 


for  m-  <  t  <  m+  with  «?_  =  max(0,  «l+  +  n+\  —  n)  and  m+  =  min(«1+,  n+ 1).  In  that 
case  the  conditional  distribution  satisfies  (7.7)  with  c(to,  t)  —  ("‘ )  ^  and  with  tp  = 

t  +  5.  The  resulting  exact  conditional  test  that  ft  —  0  is  Fisher’s  exact  test  for  2  x  2  tables 
(Section  3.5.1). 


7.3.4  Small-Sample  Conditional  Inference  for  Linear  Logit  Model 

The  linear  logit  model,  logit(7r,)  =  a  +  fix, ,  applies  to  /  x  2  tables  with  ordered  rows 
(Section  5.3.4).  For  it,  the  data  (y,}  are  I  independent  {bin(«,  ,  t r,  )}  counts,  with  fixed  row 
totals  {«,}.  Conditioning  on  £  y,  and  hence  the  column  totals  yields  a  conditional  likelihood 
free  of  a  (Cox  1958).  Exact  inference  about  ft  uses  its  sufficient  statistic,  T  =  x,y;. 

From  (7.7)  its  distribution  has  the  form 


P(T  =/|E,y,  =to\ft)  = 


cjtp,  t)ep' 
T.U  c(to,  U)e?“ 


(7.10) 


Here,  c(tp,  u)  equals  the  sum  of  ^f~[i  (  j  J  for  all  tables  with  the  given  marginal  totals  that 
have  T  =  u. 

When  ft  =  0,  the  cell  counts  have  distribution  that  is  a  special  case  of  a  multivariate 
hypergeometric  distribution,  to  be  shown  in  (16.27).  To  test  Hp\  ft  =  0,  ordering  the  tables 
with  the  given  margins  by  T  is  equivalent  to  ordering  them  by  the  Cochran-Armitage 
statistic.  Thus,  this  test  for  the  linear  logit  model  is  an  exact  trend  test. 

In  Section  5.3.6  we  applied  the  Cochran-Armitage  test  to  Table  5.3  on  maternal  alcohol 
consumption  and  infant  malformation.  Even  though  n  =  32,  574,  the  table  is  highly  unbal¬ 
anced,  with  both  very  small  and  very  large  counts.  It  is  safer  to  use  small-sample  methods. 
For  the  exact  conditional  trend  test  with  the  same  row  scores  as  used  there,  the  one-sided 
P-value  for  Ha  :  ft  >  0  is  0.0168.  The  two-sided  P- value  is  0.0172,  reflecting  asymmetry  of 
the  conditional  distribution,  given  the  marginal  counts.  We  obtained  a  two-sided  F-value 
of  0.010  with  the  large-sample  Cochran-Armitage  test. 
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7.3.5  Small-Sample  Tests  of  Conditional  Independence  in  2  x  2  x  K  Tables 

For  2x2  x  K  tables  {»,;*},  the  Cochran-Mantel-Haenszel  test  of  XY  conditional  inde¬ 
pendence  uses  a  standardization  of  n  m-.  For  the  logistic  model 

logit(xr,i)  =  a  +  fi.Xj  +  fif,  /  =  1,  2,  k  —  1, . . . ,  K,  (7.1 1) 

this  is  the  sufficient  statistic  for  the  effect  of  X.  To  conduct  a  small-sample  test  of  fi  —  0, 
we  need  to  eliminate  the  other  model  parameters.  Constructing  the  likelihood  reveals  that 
the  sufficient  statistics  for  [fif]  are  the  column  marginal  totals  {n+jk}  in  each  partial  table. 
When  X  and  Z  are  predictors,  it  is  natural  to  treat  the  numbers  of  trials  {«,■+*}  at  each 
combination  ofXZ  values  as  fixed.  Thus,  small-sample  inference  about  conditions  on  the 
row  and  column  totals  in  each  stratum. 

Conditional  on  the  strata  margins,  an  exact  test  uses  T  =  ^2kn m-.  Hypergeometric 
probabilities  occur  in  each  partial  table  for  the  independent  null  distributions  of  {«m, 
k  =  1, . . . ,  K}.  The  product  of  the  K  mass  functions  gives  the  null  joint  distribution3  of 
{flm}.  This  determines  the  null  distribution  of  T.  For  Ha:  /3  >  0,  the  P-value  is  the  null 
probability  that  T  >  t0bs»  for  the  fixed  strata  marginal  totals.  Mehta  et  al.  (1985)  presented 
a  fast  algorithm.  The  test  simplifies  to  Fisher’s  exact  test  when  K  =  1 . 

7.3.6  Example:  Promotion  Discrimination 

Table  7.5  refers  to  U.S.  government  computer  specialists  of  similar  seniority  considered  for 
promotion.  The  table  cross-classifies  promotion  decision  by  employee’s  race,  considered 
for  three  separate  months.  We  test  conditional  independence  of  promotion  decision  and  race, 
or  Hq:  p  =  0  in  model  (7.1 1).  The  table  contains  several  small  counts.  The  overall  sample 
size  is  not  small  ( n  =  74),  but  one  marginal  count  (collapsing  over  month  of  decision) 
equals  zero,  so  we  might  be  wary  of  using  the  CMH  test. 

For  Ha\  fi  <0,  the  probability  of  promotion  was  lower  for  black  employees  than  for 
white  employees.  Given  the  margins  of  the  partial  tables  in  Table  7.5,  n  1 1 1  can  range  between 
0  and  4,  n\\ 2  can  range  between  0  and  4,  and  n\\ 3  can  range  between  0  and  2.  The  total 
T  =  n  1  n-  can  range  between  0  and  10.  The  sample  data  are  the  most  extreme  possible 
result  in  each  case.  The  observed  ^  n  |  u  =  0,  and  the  P-value  is  the  null  probability  of 
this  outcome.  Software  provides  P  —  0.026.  A  two-sided  P-value,  based  on  summing  the 
probabilities  of  all  tables  no  more  likely  than  the  observed  table,  equals  0.056. 


Table  7.5  Promotion  Decisions  by  Race  and  by  Month 


July 

August 

September 

Promotions 

Promotions 

Promotions 

Race 

Yes 

No 

Yes 

No 

Yes 

No 

Black 

0 

7 

0 

7 

0  8 

White 

4 

16 

4 

13 

2 

13 

Source:  J.  Gastwirth,  Statistical  Reasoning  in  Law  and  Public 
Policy.  San  Diego,  CA:  Academic  Press,  1988,  p.  266. 


3This  is  (16.29)  in  Chapter  16,  setting  0  =  1. 
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7.3.7  Discreteness  Complications  of  Using  Exact  Conditional  Inference 

Like  Fisher’s  exact  test,  exact  conditional  inference  for  logistic  regression  is  conservative 
because  of  discreteness.  This  is  especially  true  when  n  is  small  or  the  data  are  unbalanced, 
with  most  observations  falling  in  a  single  column  or  row.  Using  mid  P-values  in  tests  and 
related  confidence  intervals  reduces  conservativeness. 

A  particular  difficulty  occurs  when  no  other  set  of  {>’*)  values  has  the  same  value  as 
the  observed  data  for  a  sufficient  statistic  y,x,y  on  which  we  condition.  In  that  case  the 
conditional  distribution  of  the  sufficient  statistic  is  degenerate.  The  P-value  for  the  exact  test 
then  equals  1 .0.  This  commonly  happens  when  at  least  one  explanatory  variable  Xj  whose 
effect  is  conditioned  out  for  the  inference  is  continuous,  with  unequally  spaced  observed 
values. 

Finally,  a  limitation  of  the  conditional  approach  is  requiring  sufficient  statistics  for 
the  nuisance  parameters.  Reduced  sufficient  statistics  exist  only  with  GLMs  that  use  the 
canonical  link.  Thus,  for  instance,  the  conditional  approach  works  for  logistic  models  but 
not  probit  models. 


7.4  SMOOTHING:  KERNELS,  PENALIZED  LIKELIHOOD, 

GENERALIZED  ADDITIVE  MODELS 

So  far  in  this  text  we’ve  performed  rather  severe  smoothings  of  categorical  data,  by  pro¬ 
ducing  fitted  values  satisfying  a  particular  model.  In  Sections  1.6  and  3.6  we  found  that 
Bayesian  methods  can  perform  a  weaker  type  of  smoothing  than  this,  for  example,  by 
shrinking  cell  proportions  in  a  contingency  table  toward  a  simple  model  without  explicitly 
assuming  that  model.  In  Section  7.2,  we  employed  Bayesian  fitting  of  binary  regression 
models,  essentially  smoothing  the  ML  fit  in  the  direction  of  a  prior  distribution.  This  section 
presents  frequentist  ways  of  smoothing  categorical  data,  mainly  in  the  context  of  analyzing 
a  binary  response  variable. 


7.4.1  How  Much  Smoothing?  The  Variance/Bias  Trade-off 

Smoothing  methods,  in  a  sense,  have  more  of  a  nonparametric  fashion,  as  they  base  analyses 
on  a  more  general  structure.  There  is  then  less  potential  for  incorrect  conclusions  because 
of  model  misspecification.  However,  in  some  ways  the  demands  are  greater:  We  need  to 
choose  among  a  potentially  infinite  number  of  forms  relating  the  response  variable  to  the 
explanatory  variables,  the  number  of  parameters  is  also  then  potentially  much  larger,  and 
overfitting  is  a  danger. 

As  we  explained  in  Section  3.3.8,  the  comparison  between  completely  model-based 
and  other  methods  is  at  the  heart  of  the  fundamental  statistical  trade-off  between  variance 
and  bias.  Using  a  particular  model  has  the  disadvantage  of  increasing  the  potential  bias 
(e.g„  a  true  probability  differing  from  the  value  corresponding  to  fitting  the  model  to  the 
population);  but,  it  has  the  advantage  that  the  parsimonious  decrease  in  the  parameter  space 
has  the  effect  of  decreasing  the  variance  in  estimating  characteristics  of  interest. 

The  methods  presented  in  this  section  provide  a  compromise,  typically  starting  with 
a  model  but  smoothing  results  in  some  way  to  adjust  for  ways  the  model  may  fail.  All 
smoothing  methods  require  input  from  the  methodologist  to  control  the  degree  of  smooth¬ 
ness  imposed  on  the  data  in  order  to  deal  with  the  bias/variance  trade-off,  whether  it  be 
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determined  by  a  smoothing  parameter  in  a  frequentist  approach  or  a  prior  distribution  in  a 
Bayesian  approach. 

7.4.2  Kernel  Smoothing 

Kernel  estimation  is  a  smoothing  method  that  in  its  basic  form  is  completely  non-model- 
based.  It  is  useful  for  any  type  of  data,  providing  a  sort  of  nonparametric  way  to  estimate 
a  probability  density  or  mass  function  without  assuming  a  parametric  distribution.  Like 
Bayesian  methods,  to  estimate  a  mean  (such  as  a  cell  probability)  at  a  particular  point,  it 
smooths  the  data  by  using  not  only  the  data  at  that  point  (such  as  a  sample  proportion)  but 
also  the  data  at  other  points. 

First,  consider  estimating  joint  cell  probabilities  it  in  a  multiway  contingency  table 
by  smoothing  the  sample  cell  proportions  p.  Let  K  denote  a  square  matrix  containing 
nonnegative  elements.  Kernel  estimates  of  n  have  the  simple  form 


H  =  Kp.  (7.12) 

The  column  totals  of  K  are  taken  to  be  1,  which  forces  the  sum  of  elements  in  ft  to  be  1, 
like  p.  Such  kernels  are  usually  constructed  to  yield  probability  estimates  of  form 


if,  =  (1  —  A.)p,  +  A.(  smoother,), 


where  A.  is  a  constant  that  controls  the  degree  of  smoothing.  Greater  X  provides  more 
smoothing.  The  structure  used  for  the  smoother,  as  imbedded  in  K  in  expression  (7.12), 
incorporates  the  other  observations,  its  form  depending  on  whether  variables  are  binary, 
nominal,  or  ordinal.  For  ordinal  data,  for  example,  the  smoothing  gives  more  weight  to 
nearby  cells  and  works  well  when  true  probabilities  in  nearby  cells  are  similar. 

This  method  can  also  smooth  binary  response  data  in  a  regression  context,  for  example, 
for  constructing  a  graph  to  portray  the  form  of  dependence  of  y  on  a  predictor.  Copas 
(1983)  presented  a  simple  method  of  this  sort  for  a  single  quantitative  explanatory  variable 
x.  Let  <j){-)  denote  a  symmetric  unimodal  kernel  function.  This  is  usually  taken  to  be  a 
bell-shaped  pdf,  such  as  the  standard  normal.  At  any  value  x,  the  kernel  smoothed  estimate 
of  P(Y  =  \  \X  =  x)  is 


_  E, 

Ei  -  x,)/X\ 


(7.13) 


where  X  >  0  is  a  smoothing  parameter.  At  any  point  x,  the  estimate  ft(x)  is  a  weighted 
average  of  the  {y,}.  For  the  simple  function  f(u)  —  1  when  u  =  0  and  <p{u)  =  0  other¬ 
wise,  fr(xj t)  simplifies  to  the  sample  proportion  of  successes  at  x  =  x^.  Then,  there  is  no 
smoothing.  When  <p  is  proportional  to  the  standard  normal  pdf,  4>(u )  =  exp[— u2/2\,  we  get 
behavior  approaching  this  by  letting  X  — >  0. 

For  very  small  X,  only  points  quite  close  to  a-  have  much  influence.  Then,  using  mainly 
very  local  data,  there  is  little  bias  but  high  variance.  By  contrast,  as  X  increases,  data  points 
farther  from  x  also  can  have  a  significant  contribution  to  ft (x).  As  X  increases  and  more 
weight  is  given  to  points  greatly  distant,  the  smoother  is  more  like  the  overall  sample 
proportion,  being  more  highly  biased  but  with  smaller  variance.  As  X  grows  unboundedly, 
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the  smooth  function  ft(x)  converges  to  a  horizontal  line  at  the  level  of  the  overall  sample 
proportion  p  of  successes  (Exercise  7.33). 

For  this  kernel  smoother,  the  choice  of  A.  is  more  important  to  determining  fr(x)  than 
is  the  choice  of  t/>.  Copas  recommended  selecting  X  by  plotting  the  resulting  function  for 
several  values  of  X ,  varying  around  a  value  equal  to  10  times  the  average  spacing  of  the  x 
values. 

7.4.3  Example:  Smoothing  to  Portray  Probability  of  Kyphosis 

Hastie  andTibshirani  (1990,  p.  282)  described  a  study  to  determine  risk  factors  for  kyphosis, 
which  is  severe  forward  flexion  of  the  spine  following  corrective  spinal  surgery.  Figure  7.4 
shows  this  binary  outcome  y  (1  =  kyphosis  present,  0  =  absent)  plotted  against  the  age 
in  months  at  the  time  of  the  operation.  At  the  very  low  and  very  high  levels  of  age,  most 
observations  have  kyphosis  absent. 

Figure  7.4  also  shows  the  result  of  kernel  smoothing  of  the  data  using  the  smoother 
(7.13),  with  X  =  25,  100,  and  200.  The  value  X  =  25  is  too  low,  and  the  figure  is  more 
irregular  than  the  data  justify.  The  higher  values  of  X  give  evidence  of  nonmonotonity  in 
the  relationship.  In  fact,  adding  a  quadratic  term  to  the  standard  logistic  regression  model 
provides  an  improved  fit  (Exercise  5.8). 

7.4.4  Nearest  Neighbors  Smoothing 

In  more  general  contexts  than  binary  regression,  smoothers  of  the  kernel  type  can  base 
estimation  at  a  point  by  using  nearby  points.  A  very  simple  such  method  is  nearest  neighbors 
smoothing.  It  is  often  used  for  classification,  such  as  by  predicting  an  observation  for  a 
subject  based  on  a  weighted  average  of  observations  for  k  subjects  who  have  similar  values 
on  the  explanatory  variables. 


Figure  7.4  Kernel  smoothing  estimate  of  probability  of  kyphosis  as  function  of  age,  using  smoothing  parameter 
X  =  25  (solid  curve),  100  (dashed  curve),  200  (dotted  curve). 
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Let  Sjj  be  a  measure  of  the  similarity  between  subjects  /'  and  /,  such  as  the  Euclidean 
distance  or  Mahalanobis  distance  between  values  x,  and  Xj  of  explanatory  variables  for  the 
two  subjects,  using  standardized  variables.  For  subject  /,  let  N(i )  be  the  set  of  k  subjects 
who  are  the  nearest  neighbors,  having  the  k  smallest  values  for  s,y  among  j  —  1 
Then,  for  a  binary  response,  the  probability  it,-  =  P(Y ,■  =  1 )  is  estimated  by 

E/6W(/)W 

^  • 
i—f  jsN(i)  Sij 

Greater  smoothing  is  produced  by  letting  k  be  larger. 

Sometimes  the  number  of  neighbors  k  to  be  used  is  fixed.  Alternatively,  cross-validation 
methods  can  be  used  to  determine  k.  For  each  value  of  k  =  1,2,...,  we  could  predict 
each  observation  using  k  neighbors,  and  then  select  the  value  of  k  for  which  the  overall 
misclassification  rate  is  smallest. 

An  advantage  of  this  method  is  its  simplicity,  once  we  select  a  similarity  measure  to 
determine  the  nearest  neighbors.  However,  the  choice  of  this  measure  may  not  be  obvious, 
especially  when  the  number  p  of  explanatory  variables  is  large  with  possibly  some  subsets 
of  them  being  highly  correlated  and  some  of  them  being  qualitative.  Also,  the  decision 
boundary  between  the  x  values  for  classifying  subjects  into  one  category  and  the  x  values 
for  classifying  subjects  into  another  category  can  be  highly  irregular,  especially  when  k  is 
small.  By  contrast,  the  decision  boundary  is  quite  simple  for  standard  binary  regression 
models  and  some  other  methods,  such  as  the  linear  discriminant  method  described  in 
Section  15.1. 

More  complex  smoothers  further  generalize  this  idea,  for  example,  by  basing  the  pre¬ 
diction  at  a  point  by  a  weighted  regression  using  nearby  points,  such  as  described  later  in 
Section  7.4.9.  Such  methods  have  better  statistical  properties,  such  as  usually  lower  bias 
than  kernel  smoothing.  However,  simple  kernel  smoothing  is  usually  adequate  for  providing 
a  sense  of  the  main  features  of  the  true  relationship. 

7.4.5  Smoothing  Using  Penalized  Likelihood  Estimation 

Kernel  smoothing  does  not  assume  a  probability  distribution  for  Y  or  account  for  any 
dependence  among  the  observations.  Other  methods  do  this,  for  example,  by  adjusting  a 
likelihood  function  in  such  a  way  to  induce  smoothing.  Consider  an  arbitrary  model  with 
generic  parameter  (i  and  log-likelihood  function  L(fi).  The  penalized  likelihood  estimator 
of/J  maximizes 


=  L(fi)  -  HP), 

where  X(-)  is  a  function  that  provides  a  roughness  penalty.  That  is,  X(-)  is  such  that  X(f)) 
decreases  as  elements  of  are  smoother  in  some  sense,  such  as  uniformly  closer  to  0. 

First,  consider  smoothing  a  sparse  contingency  table,  in  which  f)  are  multinomial  cell 
probabilities  it .  For  two-way  tables  Simonoff  ( 1 983)  suggested  using  a  penalized  likelihood 
approach  with  penalty 

X(JI  >  =  X  ^[|og(7ry7r/  +  l,j+l)(7r/  +  l.j7r,.;+i)]2 

/'  j 
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for  the  local  odds  ratios.  This  seems  especially  sensible  with  ordinal  variables.  It  provides 
shrinkage  toward  the  independence  estimator,  for  which  the  local  log  odds  ratios  equal  0. 
To  select  the  smoothing  parameter  X,  one  approach  minimizes  an  approximation  for  the 
mean  squared  error  of  the  estimator. 

In  a  more  general  modeling  context,  penalized  likelihood  applies  with  standard  models 
such  as  logistic  regression.  For  models  incorporating  standardized  versions  of  explanatory 
variables,  Lee  and  Silvapulle  (1988)  and  le  Cessie  and  van  Houwelingen  (1992)  used  a 
quadratic  penalty  term  of  form  X{/i)  =  X  Yj  P2j-  This  has  the  effect  of  shrinking  estimates 
toward  0  and  reducing  prediction  error. 

Penalized  likelihood  methods  are  examples  of  regularization  methods.  These  are  ways  of 
modifying  ML  methods  to  give  sensible  answers  in  situations  that  are  unstable  in  some  way, 
such  as  modeling  using  data  sets  containing  very  large  numbers  of  variables.  Regularization 
methods  that  penalize  by  a  term  that  is  quadratic  in  ft,  such  as  A.(j8)  =  X  Yj  Pj,  are 
referred  to  as  L 2- norm  methods.  They  are  analogs  of  ridge  regression  for  normal-response 
models.  By  contrast,  L\-norm  regularization  uses  penalty  )Jfi)  =  X  Y,  \Pj\-  Equivalently, 
it  maximizes  the  likelihood  subject  to  the  constraint  that  Y,  I Pj  I  <  K  for  some  constant 
K.  In  ordinary  regression,  this  penalty  method  is  referred  to  as  the  lasso  (“least  absolute 
shrinkage  and  selection  operator”).  Another  possible  penalty,  using  Lo-norm,  takes  X(P) 
to  be  proportional  to  the  number  of  nonzero  Pj.  This  approach  has  AIC  and  BIC  as 
special  cases.  This  sounds  ideal,  but  optimization  with  this  criterion  is  impractical  with 
large  numbers  of  variables;  for  example,  the  function  minimized  may  not  be  concave.  A 
compromise  method,  SCAD  (“smoothly  clipped  absolute  deviation”),  starts  at  the  origin 
>3  =  0  like  a  L 1  penalty  and  then  gradually  levels  off  (Fan  and  Lv  2010). 

As  in  kernel  smoothing,  with  penalized  likelihood  the  degree  of  smoothing  depends 
on  the  smoothing  parameter  X,  the  choice  of  which  reflects  the  bias/variance  trade-off. 
Increasing  X  results  in  greater  shrinkage  toward  0  in  the  estimates  of  {Pj}  and  smaller 
variance  but  greater  bias.  Cross-validation  criteria  for  selecting  X  are  based  on  fitting  the 
model  to  part  of  the  data  and  then  checking  goodness  of  that  fit  in  terms  of  predictions  for 
the  rest  of  the  data.  With  £-fold  cross-validation,  this  is  done  k  times,  each  time  leaving 
out  the  fraction  1  /k  of  the  data  and  predicting  it  using  the  model  fit  for  the  remaining 
data.  The  selected  value  of  X  is  the  one  for  which  the  estimates  have  the  lowest  average 
prediction  error,  in  some  sense.  That  X  value  is  then  used  with  the  method  applied  to  all 
the  data. 

Any  particular  norm  for  the  penalty  function  has  advantages  and  disadvantages.  Use  of  a 
quadratic  penalty  is  not  a  strategy  for  finding  a  parsimonious  model,  because  all  explanatory 
variables  remain  in  the  model.  By  contrast,  with  the  lasso  (L 1  penalty),  when  X  is  large 
some  Pj  shrink  to  zero.  With  it,  is  is  informative  to  plot  the  estimates  as  a  function  of  X, 
to  summarize  how  explanatory  variables  drop  out  as  X  increases.  For  a  factor  predictor, 
the  ordinary  lasso  solution  may  select  individual  indicators  rather  than  entire  factors,  and 
the  solution  may  depend  on  the  coding  scheme,  so  an  alternative  grouped  lasso  should  be 
used.  Disadvantages  of  the  lasso  approach  compared  with  quadratic  penalties  are  that  it 
may  overly  penalize  Pj  that  are  truly  large,  and  the  {Pj}  are  not  asymptotically  normal  and 
can  be  highly  biased,  making  inference  difficult.  Some  research  on  the  lasso  and  related 
approaches  has  focused  on  adjusting  the  penalty  function  to  make  inference  possible,  to 
better  determine  which  predictors  truly  have  effects  and  to  eliminate  those  that  do  not,  and 
to  penalize  less  severely  when  \Pj\  is  large.  For  example,  the  SCAD  approach  puts  little 
penalty  on  an  effect  that  is  estimated  to  be  large  but  also  has  the  effect  of  equating  small 
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coefficients  to  0.  An  alternative  penalty  function  has  both  L\  and  Li  terms,  each  with  its 
own  penalty  function.  For  binary  data,  Notes  7.7  and  7.8  cite  several  articles  that  have 
proposed  and  evaluated  penalized  likelihood  methods. 

Penalized  likelihood  estimators  have  connections  with  Bayesian  smoothing  methods. 
With  prior  density  function  proportional  toexp[— X(p)],  the  posterior  density  is  proportional 
to  the  penalized  likelihood  function.  Hence,  the  mode  of  the  posterior  distribution  equals 
the  penalized  likelihood  estimator. 

7.4.6  Why  Shrink  Estimates  Toward  0? 

To  methodologists  who  are  used  to  estimators  that  are  unbiased  or  approximately  so, 
methods  such  as  penalized  likelihood  that  shrink  {/3;- }  toward  0  can  seem  counterintuitive. 
Here  i  some  intuition  about  why  shrinkage  may  be  effective.  First,  consider  settings  having 
a  large  number  of  explanatory  variables  for  which  most  of  them  may  have  no  effects  or  very 
minor  effects,  as  in  many  genetics  applications  such  as  discussed  in  Section  7.5.  Unless  n  is 
very  large,  by  ordinary  sampling  variability  ML  estimates  {$j}  will  tend  to  be  much  larger 
in  absolute  value  than  the  true  values.  This  tendency  is  exacerbated  when  we  consider  only 
statistically  significant  values.  Shrinkage  such  as  occurs  with  penalized  likelihood  methods 
tends  to  move  such  estimates  closer  to  the  true  values. 

Second,  variable  selection  methods  such  as  the  stepwise  procedures  discussed  in  Section 
6. 1 .3  are  highly  discrete,  in  the  sense  that  any  particular  variable  either  is  or  is  not  selected. 
Penalized  likelihood  is  more  continuous  in  nature,  with  some  variables  perhaps  receiving 
little  influence  in  the  resulting  prediction  equation  but  not  being  completely  eliminated. 
With  the  lasso  method,  a  variable  could  be  eliminated,  but  in  a  more  objective  way  that  is 
not  dependent  on  which  variables  were  previously  eliminated. 

7.4.7  Firth’s  Penalized  Likelihood  for  Logistic  Regression 

Penalizing  a  likelihood  need  not  necessarily  increase  bias.  One  version  actually  has  been 
shown  to  reduce  bias  of  ML  estimators  (Firth  1993a).  For  most  models  the  ML  estimator  /? 
has  bias  on  the  order  of  1  /n,  and  Firth  showed  how  to  penalize  the  log  likelihood  such  that 
this  reduces  to  order  1  / n2 .  For  the  canonical  parameter  of  an  exponential  family  model,  the 
penalized  log-likelihood  function  utilizes  the  determinant  of  the  information  matrix  J , 

=  L(f))  +  \  log  \J\. 

For  application  to  logistic  regression.  Firth  noted  that  when  the  model  matrix  is  of  full 
rank,  log  \  J\  is  strictly  concave.  Maximizing  the  penalized  likelihood  yields  a  maximum 
penalized  likelihood  estimate  that  always  exists  and  is  unique.  This  penalized  likelihood 
then  is  proportional  to  the  Bayesian  posterior  distribution  resulting  from  using  the  Jeffreys 
prior.  Thus,  this  penalized  ML  estimator  equals  the  mode  of  the  posterior  distribution 
induced  by  the  Jeffreys  prior. 

7.4.8  Example:  Complete  Separation  but  Finite  Logistic  Estimates 

One  situation  in  which  Firth’s  penalized  likelihood  estimate  is  very  helpful  is  when  complete 
or  quasi-complete  separation  occurs  in  the  space  of  explanatory  variables.  Then,  ordinary 
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ML  estimates  of  logistic  regression  parameters  are  infinite  or  do  not  exist  (Section  6.5.1), 
but  the  penalized  estimator  is  finite.  Heinze  and  Schemper  (2002)  discussed  this  case. 

We  illustrate  with  the  data  from  Table  7.2  on  risk  factors  for  the  histological  grade  of 
endometrial  cancer,  analyzed  with  Bayesian  methods  in  Section  7.2.2.  For  the  model 

logittftF  =  1)]  =  a  +  ftxi  +  ftx2  +  &X3, 

there  is  quasi-complete  separation,  and  the  ML  estimate  ft  =  00.  With  standardized  x2 
and  *3,  the  other  ML  estimated  effects  are  ft  =  —0.42  (SE  =  0.44)  and  ft  =  —1.92 
(SE  =  0.56). 

For  comparison,  the  Firth  penalized  likelihood  estimates  are  ft  =  2.93  (SE  =  1.55), 
ft  =  —0.35  (SE  =  0.40),  and  ft  —  —1.72  (SE  =  0.51).  The  95%  profile  penalized  likeli¬ 
hood  confidence  interval  for  ft  is  (0.61, 7.85),  which  shrinks  the  ordinary  profile  likelihood 
interval  of  (1.28,  00)  considerably  toward  0.  Results  for  the  other  two  estimates  do  not 
change  as  much. 

The  penalized  likelihood  estimates  are  posterior  modes  for  the  Bayesian  approach  using 
the  Jeffreys  prior.  Compared  with  the  posterior  means  for  the  normal  priors  reported  in 
Table  7.3,  they  fall  between  the  results  for  normal  priors  with  a  =  1  and  with  a  =  10.  In 
this  case,  independent  normal  priors  having  a  —2  provide  a  similar  posterior  mean  for  ft 
as  Firth’s  penalized  estimate. 

7.4.9  Generalized  Additive  Models 

The  GLM  generalizes  the  ordinary  linear  model  to  permit  nonnormal  distributions  and 
modeling  functions  of  the  mean.  The  quasi-likelihood  approach  (Section  4.7)  generalizes 
GLMs,  specifying  how  the  variance  depends  on  the  mean  without  assuming  a  particular 
distribution.  Another  generalization  of  the  GLM  replaces  the  linear  predictor  by  additive 
smooth  functions  of  the  predictors.  The  GLM  structure  g(fij)  =  J2,  PjXjj  then  generalizes 
to 


g(Hi)  = 

j 

where  Sj(-)  is  an  unspecified  smooth  function  of  predictor  /.  A  useful  smooth  function  is  the 
cubic  spline.  It  has  separate  cubic  polynomials  over  sets  of  disjoint  intervals,  joined  together 
smoothly  at  boundaries  of  those  intervals.  The  boundary  points,  called  knots,  could  be  at 
evenly  spaced  points  for  each  predictor  or  selected  according  to  some  criterion  involving 
both  smoothness  and  closeness  of  the  spline  to  the  data. 

Like  GLMs,  this  model  specifies  a  link  function  g  and  a  distribution  for  the  random 
component.  The  resulting  model  is  called  a  generalized  additive  model,  symbolized  by 
GAM  (Flastie  and  Tibshirani  1990).  The  GLM  is  the  special  case  in  which  each  s;  is  a 
linear  function.  Also  possible  is  a  mixture  of  explanatory  terms  of  various  types,  with  some 
Sj  as  smooth  functions,  others  as  linear  functions  as  in  GLMs,  and  others  as  indicator 
variables  to  include  qualitative  factors. 

The  details  for  fitting  GAMs  are  beyond  our  scope.  The  backfitting  algorithm  employs  a 
generalization  of  the  Newton-Raphson  method  that  utilizes  local  smoothing.  The  algorithm 
initializes  {f ; }  identically  at  0.  Then,  at  a  particular  iteration,  it  updates  the  estimate  s;  by  a 
smoothing  of  the  {y,  —  &(*/*)}  that  uses  the  other  estimated  smooth  functions  at  that 
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iteration,  in  turn  for  j  =  1, . . . ,  p.  The  fitting  procedure  corresponds  to  subtracting  from 
the  log-likelihood  function  a  penalty  function  that  increases  as  the  smooth  function  gets 
more  wiggly. 

The  model  fit  assigns  a  deviance  contribution  and  an  approximate  df  value  to  each  sj  in 
the  additive  predictor,  enabling  inference  about  those  terms.  For  instance,  a  smooth  function 
having  df  =  4  is  similar  in  overall  complexity  to  a  third-degree  polynomial,  which  has  four 
parameters.  Choosing  a  df  value  or  a  value  for  a  smoothing  parameter  determines  how 
smooth  the  resulting  GAM  fit  looks.  As  in  GLMs,  we  can  compare  deviances  for  nested 
models  to  test  whether  a  model  gives  a  significantly  better  fit  than  a  simpler  model.  A 
disadvantage  compared  with  GLMs  is  the  loss  of  interpretability  for  describing  effects  of 
an  explanatory  variable  that  has  a  smooth  term  in  the  model. 

It  is  usually  sensible  to  try  various  degrees  of  smoothing  to  find  one  that  smooths  the 
data  sufficiently  so  that  the  trend  is  not  too  irregular  but  does  not  smooth  so  much  that  it 
suppresses  interesting  patterns.  The  smoothing  may  suggest  that  a  linear  model  is  adequate 
with  a  particular  link  function  or  it  may  suggest  ways  to  improve  on  linearity.  Some  software 
packages  that  do  not  have  GAMs  can  smooth  the  data  by  employing  a  type  of  regression  that 
gives  greater  weight  to  nearby  observations  in  predicting  the  value  at  a  given  point;  such 
locally  weighted  least-squares  regression  is  often  referred  to  as  lowess.  We  prefer  GAMs 
because  they  recognize  explicitly  the  form  of  the  response  variable.  For  instance,  with  a 
binary  response,  lowess  can  give  predicted  values  below  0  or  above  1  at  some  predictor 
settings.  This  cannot  happen  with  a  GAM  that  assumes  a  binomial  random  component. 

Even  if  you  plan  to  use  GLMs,  a  GAM  is  helpful  for  exploratory  analysis.  For  instance, 
for  continuous  x  with  continuous  responses,  scatter  diagrams  provide  visual  information 
about  the  dependence  ofy  on  x.  For  binary  responses,  such  diagrams  are  not  very  informa¬ 
tive.  Plotting  the  fitted  smooth  function  for  a  predictor  may  reveal  a  general  trend  without 
assuming  a  particular  functional  relationship. 

7.4.10  Example:  GAMs  for  Horseshoe  Crab  Mating  Data 

Forthe  horseshoe  crab  data  introduced  in  Section  4.3.2,  Figure  4.4  showed  the  trend  relating 
a  female  crab’s  number  of  male  satellites  to  the  width  of  her  carapace  shell.  This  smooth 
curve  is  the  fit  of  a  generalized  additive  model,  assuming  a  Poisson  distribution  and  using 
the  log  link. 

In  Section  5. 1 .3  we  used  logistic  regression  to  model  the  probability  that  a  female  crab 
has  at  least  one  male  satellite.  For  crab  i,  y,  =  1  if  she  has  at  least  one  satellite  and  y,  =  0 
otherwise.  Figure  5.2  plotted  these  data  against  the  crab’s  carapace  width.  That  figure  also 
showed  a  curve  based  on  smoothing  the  data  using  a  GAM,  assuming  a  binomial  response 
and  logit  link.  This  curve  shows  a  roughly  increasing  trend  and  is  more  informative  than 
viewing  the  binary  data  alone. 

7.4.11  Advantages/Disadvantages  of  Various  Smoothing  Methods 

Compared  with  simple  kernel  smoothing,  penalized  likelihood  methods  and  the  GAM  have 
the  advantage  that  the  ordinary  GLM  is  a  special  case.  They  also  have  a  direct  inferential 
aspect,  as  they  mimic  GLMs  in  assuming  a  binomial  distribution  for  a  binary  response  and 
having  a  df  value  associated  with  each  explanatory  effect. 

Compared  with  ordinary  frequentist  inference,  all  these  methods  have  the  extra  aspect 
of  choosing  the  degree  of  smoothing.  With  the  Bayesian  approach,  this  is  handled  with 
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the  choice  of  prior  distribution.  An  advantage  of  the  Bayesian  approach  is  that  its  entire 
formulation  has  a  stronger  theoretical  basis,  whereas  the  other  smoothing  methods  have 
somewhat  of  an  ad  hoc  nature  in  their  adaptive  choice  of  a  smoothing  parameter.  But,  of 
course,  some  methodologists  are  uncomfortable  with  the  Bayesian  paradigm  because  of 
the  need  to  impose  the  prior;  they  may  instead  prefer  such  frequentist  methods  or  empirical 
Bayes  methods  that  let  the  data  determine  the  degree  of  smoothing. 


7.5  ISSUES  IN  ANALYZING  HIGH-DIMENSIONAL  CATEGORICAL  DATA 

In  an  increasing  variety  of  recent  applications,  data  sets  differ  from  traditional  ones  in 
having  a  very  large  number  p  of  variables,  sometimes  even  more  than  the  number  n  of 
observations.  In  genomics,  such  applications  include  classifying  tumors  by  using  microarray 
gene  expression  or  proteomics  data  or  associating  protein  concentrations  with  expression 
of  genes  or  predicting  a  clinical  prognosis  by  using  gene  expression  data.  Variable  selection 
is  especially  important  in  such  applications,  as  typically  most  effects  are  expected  to  be 
null.  Other  applications  having  large  p  include  biomedical  imaging,  functional  magnetic 
resonance  imaging,  tomography,  signal  processing,  image  analysis,  market  basket  data,  and 
portfolio  allocation  in  finance. 

Traditional  modeling  can  deal  effectively  with  the  sorts  of  examples  shown  in  this  book, 
such  as  modeling  a  response  for  a  few  drugs  and  several  centers  in  a  clinical  trial,  but  it 
can  be  overwhelmed  when  it  needs  to  address  differential  expression  (i.e.,  change  between 
two  or  more  conditions)  in  many  thousands  of  genes  or  brain  activity  in  many  thousands  of 
locations.  Special  software  packages4  are  needed  to  organize  and  analyze  the  data.  Methods 
presented  in  this  chapter,  such  as  regularization  methods  employing  penalized  likelihood, 
are  increasingly  used  with  high-dimensional  data. 

In  this  section  we’ll  discuss  the  analysis  of  high-dimensional  categorical  data.  It  is 
impossible  to  do  justice  to  the  exploding  literature  in  this  area  in  a  single  section.  We’ll 
focus  on  a  particular  difficult  issue  that  arises  in  many  such  studies — selecting  explanatory 
variables  for  an  analysis  out  of  a  very  large  set,  and  making  adjustments  for  multiplicity. 
We’ll  then  describe  some  applications  in  which  novel  approaches  have  been  proposed  for 
high-dimensional  analyses. 


7.5.1  Issues  in  Selecting  Explanatory  Variables 

In  modeling  with  a  very  large  number  of  explanatory  variables,  reducing  their  number  can 
ease  interpretability  and  decrease  prediction  errors  by  removing  variables  that  have  little  if 
any  relevance.  For  example,  in  disease  classification,  of  a  large  number  of  genes  relatively 
few  may  be  responsible  for  the  disease.  Most  effects  are  null  or  essentially  null.  This  is 
reflected  by  histograms  of  E-values  for  testing  those  effects,  which  often  have  appearance 
quite  similar  to  a  uniform  density  function.  In  addition,  with  large  p,  ordinary  ML  fitting 
may  not  even  be  possible.  For  a  binary  response,  complete  separation  often  occurs  once  the 
number  of  predictors  exceeds  a  particular  point.  Even  when  finite  estimates  exist,  they  may 
be  very  imprecise  because  of  ill-conditioning  of  the  covariance  matrix.  Moreover,  choosing 

4For  example  p/ink  for  whole  genome  association  analysis,  described  at  pngu .  mgh .  harvard . 
edu/~purcell/plink. 
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a  model  that  contains  a  large  number  of  predictors  runs  the  risk  of  overfitting  the  data. 
Then,  future  predictions  will  tend  to  be  poorer  than  those  obtained  with  a  simpler  model. 

In  regression  modeling,  variable  selection  algorithms  such  as  forward  selection  and 
backward  elimination  are  popular.  However,  such  methods  have  potential  pitfalls,  espe¬ 
cially  when  p  is  large.  In  particular,  for  the  set  of  predictors  having  no  true  effect,  the 
maximum  sample  correlation  with  the  response  can  be  quite  large.  Also,  there  can  be 
spurious  collinearity  among  the  predictors  or  spurious  correlation  between  an  important 
predictor  and  a  set  of  unimportant  predictors,  because  of  the  dimensionality.5  Other  criteria 
exist  for  identifying  an  optimal  subset  of  explanatory  variables,  such  as  minimizing  predic¬ 
tion  error  or  (with  AIC)  minimizing  divergence  of  the  fitted  model  from  reality.  With  large 
p,  though,  it  is  not  feasible  to  check  all  possible  subsets  of  predictors.  Recently,  various 
methods  have  been  proposed  to  deal  with  the  subset  selection  issue  for  large  p.  Roughly, 
the  methods  fall  into  two  types. 

One  approach  uses  alternatives  to  ML  estimation  such  as  various  penalized  likelihood 
methods  mentioned  in  Section  7.4.5.  These  include  regularization  using  L  ,,-norm  for  some 
q  between  0  and  2  and  compromise  norms.  Zhu  and  Hastie  (2004)  applied  this  to  logistic 
regression  with  q  —  2  for  microarray  cancer  diagnosis.  Besides  providing  shrinkage  of 
parameter  estimates,  some  of  those  methods  (L^-norm  with  0<q  <  1)  can  also  help  with 
variable  selection.  With  the  lasso  (q  =  1 ),  many  of  the  explanatory  variables  receive  zero 
weight  in  the  prediction  equation.  The  number  of  such  variables  included  depends  on  the 
smoothing  parameter.  However,  the  lasso  has  a  tendency  to  include  many  false  positive 
variables  when  p  is  large  (Fan  and  Lv  2010)  and  to  exclude  important  suppressor  variables 
(Magidson  2010).  Note  7.8  cites  several  articles  that  have  investigated  penalized  likelihood 
methods  for  variable  selection. 

A  second  approach  attempts  to  identify  the  relevant  effects  using  standard  significance 
tests  but  with  some  adjustment  for  multiplicity.  This  can  reduce  dramatically  the  dimen¬ 
sionality  of  the  data  by  eliminating  the  large  number  of  predictors  for  which  there  is  not 
strong  evidence  of  an  effect.  This  approach  is  especially  useful  in  applications  in  which 
a  small  portion  of  the  effects  considered  truly  exist.  We  next  discuss  such  multiplicity 
adjustments,  including  one  (th e,  false  discovery  rate )  that  has  received  substantial  attention 
in  recent  years  for  large  p  applications. 


7.5.2  Adjusting  for  Multiplicity:  The  Bonferroni  Method 

Table  7.6  summarizes  results  of  g  —  n  \  j  4-  n\2  +  «2i  +  «22  significance  tests,  of  which 
there  are  n\2  incorrect  rejections  of  Ho  (i.e.,  type  I  errors)  and  «2i  incorrect  nonrejections 
of  Ho  (type  II  errors).  Testing  each  hypothesis  at  level  a*  ensures  that  E(n\2/g)  <  a*.  In 
practice,  we  observe  the  numbers  n+  \  =(«n  +  «2i)  of  nonrejections  and  n+2  =(«i2+«22) 
of  rejections,  but  the  actual  cell  counts  are  unknown. 

A  substantial  literature  exists  about  ways  of  controlling  error  rates  when  conducting  a 
large  number  of  statistical  inferences.  In  a  testing  format,  in  making  multiple  comparisons 
of  groups  on  some  response  variable,  the  “familywise”  error  rate  approach  controls 
P{n  12  >  0),  the  probability  of  making  at  least  one  type  I  error.  In  a  confidence  interval 
format,  this  corresponds  to  having  the  confidence  coefficient  apply  to  the  entire  set  of 
intervals  formed  rather  than  to  each  individual  one. 


5  Figure  I  in  Fan  and  Lv  (2010)  illustrates  these  issues. 
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Table  7.6  Contingency  Table  Summarizing  Multiple 
Significance  Tests.  Type  I  and  Type  II  Errors  Have 
Frequencies  nn  and  nn- 


Decision 

Do  Not  Reject  Hu 

Reject  Ho 

Condition  of  Ha 

H0  true 

a  n 

a  12 

Ho  false 

«2I 

a  22 

As  Section  3.1.8  explained,  a  simple  multipurpose  way  to  establish  control  over  a 
family  of  inferences  is  the  Bonferroni  method.  This  method  ensures  a  familywise  error 
bound  of  a  =  P(n  \2  >  0)  by  setting  a*  =  a/ g  for  each  inference.  However,  the  method  is 
conservative,  having  actual  error  rate  bounded  above  by  the  nominal  level  a. 

When  g  is  enormous,  such  as  in  detecting  differential  expression  in  thousands  of  genes, 
the  Bonferroni  approach  is  too  conservative  because  a/g  is  so  tiny.  This  makes  it  difficult 
to  establish  significance  in  any  one  test  and  to  discover  any  effects  that  truly  are  there.  But, 
in  the  absence  of  such  an  adjustment,  there  is  the  danger  that  most  significant  results  found 
will  be  type  I  errors,  because  of  the  relatively  small  number  of  true  effects.  Dudoit  et  al. 
(2003)  described  many  alternatives  to  the  Bonferroni  method,  perhaps  the  most  popular  of 
which  we  present  next. 


7.5.3  Adjusting  for  Multiplicity:  The  False  Discovery  Rate 

In  Table  7.6,  consider  the  ratio  n\2/n+2,  which  is  the  proportion  of  the  rejected  null  hypo¬ 
theses  that  are  erroneously  rejected.  Then,  FDR  =  E{n\2/n+2),  where  we  sctn\2/n+2  =  0 
when  n+2  —  0,  is  called  the  false  discovery  rate  (Benjamini  and  Hochberg  1995). 

Suppose  all  null  hypotheses  are  true.  If  « 12  —  0,  then  n\2/rt+2  =  0,  whereas  if  n\2  >  0, 
then«i2/«+2  =  1 ,  so  that  FDR  =  P{n\2  >  0),  the  same  as  the  familywise  error  rate.  When 
some  null  hypotheses  are  false,  then  FDR  is  less  than  the  familywise  error  rate.  So,  if  a 
procedure  controls  the  FDR  only,  it  can  be  less  stringent  and  therefore  less  conservative.  It 
is  then  more  powerful,  tending  to  yield  more  rejections  of  false  Hf  s,  more  so  as  g  increases 
and  as  the  number  of  false  hypotheses  increases. 

There  are  several  variations  on  FDR  and  many  types  of  FDR  algorithms.  For  FDR  as  just 
defined,  Benjamini  and  Hochberg  (1995)  suggested  a  simple  algorithm  for  ensuring  FDR 
<  a  for  a  desired  a.  It  applies  with  g  independent  tests.  Let  P(\ )  <  P(2)  <  •  •  ■  <  P(X >  denote 
the  ordered  P- values.  Then,  we  reject  the  corresponding  hypotheses  (1 ), . . .,  (_/*),  where  j* 
is  the  maximum  j  for  which  P(j)  <  ja/g.  The  most  significant  test  compares  u  to  a/g 
and  has  the  same  decision  as  in  the  ordinary  Bonferroni  method,  but  then  the  other  tests 
have  less  conservative  requirements.  The  actual  FDR  for  this  method  is  bounded  above  by 
{n\+/g)a,  which  is  a  when  the  null  is  always  true. 

Benjamini  and  Hochberg  illustrated  their  method  for  a  study  about  myocardial  infarction. 
For  the  15  hypotheses  tested,  the  ordered  P- values  were 


0.0001 , 0.0004,  0.0019, 0.0095,  0.020,  0.028,  0.030, 
0.034, 0.046,  0.32,  0.43,  0.57,  0.65,  0.76,  1 .00. 
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With  a  =  0.05,  these  are  compared  with  y'(0.05)/15,  starting  with  j  =  15.  The  maxi¬ 
mum  j  for  which  P(7)  <  /( 0.0033)  is  j  =  4,  for  which  P<4)  =  0.0095  <  4(0.0033).  So,  the 
hypotheses  with  four  smallest  F-values  were  rejected.  By  contrast,  the  Bonferroni  approach 
with  family  wise  error  rate  0.05  compares  each  P- value  to  0.05/15  =  0.0033  and  rejects 
only  three  of  these  hypotheses. 

Benjamini  and  Yekutieli  (2001)  showed  that  this  method  works  even  when  the  tests  are 
positively  dependent  in  a  certain  sense.  They  suggested  an  adjusted  method  for  general 
dependence  structure,  but  it  is  more  conservative.  An  alternative  approach  fixes  a  threshold 
for  each  test  statistic  value  or  F-value  and  then  estimates  the  FDR  for  the  set  of  tests 
(Lin  et  al.  2010).  The  FDR  method  also  applies  when  the  tests  are  discrete,  under  which 
the  null  distributions  of  F-values  are  not  uniform  but  instead  are  stochastically  greater 
than  uniform.  For  discrete  data,  Gilbert  (2005)  improved  the  method  by  combining  it 
with  an  adjustment  for  discreteness  that  Tarone  (1990)  had  suggested  for  the  Bonferroni 
method. 

Because  of  its  lessened  conservatism  and  improved  power  compared  with  familywise 
methods  such  as  Bonferroni,  controlling  FDR  is  a  sensible  strategy  to  employ  in  exploratory 
research  involving  large-scale  testing.  There  is  still  then  a  place  for  traditional  familywise 
multiple  comparison  methods  in  follow-up  validation  studies  involving  the  smaller  numbers 
of  effects  found  to  be  significant  in  the  exploratory  studies.  Dudoit  et  al.  (2003)  surveyed 
these  issues,  in  the  context  of  microarray  experiments. 


7.5.4  Other  Variable  Selection  Methods  with  High-Dimensional  Data 

Fan  and  Lv  (2010)  surveyed  many  ways  of  dealing  with  large  p  by  reducing  the  number 
of  explanatory  variables.  Most  of  them  incorporate  at  least  one  of  the  subset  selection 
approaches  discussed  above — stepwise  algorithms,  regularization  such  as  penalized  likeli¬ 
hood  methods,  and  adjustments  for  multiplicity. 

One  approach  conducts  a  large-scale  screening  to  eliminate  unimportant  variables  and 
then  a  moderate-scale  screening  to  select  from  them  the  important  variables.  For  a  quantita¬ 
tive  predictor,  the  large-scale  screening  could  use  a  two-sample  t  test  to  compare  the  mean 
responses  for  the  two  groups  that  are  the  categories  for  y.  An  alternative  method  uses  a 
stepwise  algorithm,  such  as  forward  selection.  To  reduce  the  problems  mentioned  in  Section 
7.5.1  for  applying  forward  selection,  Park  and  Hastie  (2008)  recommended  implementing 
it  with  a  penalized  likelihood  function  using  a  quadratic  penalty  term. 

An  alternative  variable  reduction  method  replaces  the  set  of  explanatory  variables  by  a 
much  smaller  set  of  artificial  variables  and  then  applies  variable  selection  methods  to  them. 
With  principal  component  regression,  each  artificial  variable  is  a  linear  combination  of 
the  original  variables,  designed  to  explain  as  much  variance  as  possible.  Magidson  (2010) 
proposed  a  related  correlated  component  regression  that  bases  the  first  component  on  an 
average  of  effects  in  single-predictor  models,  the  second  component  on  an  average  of  effects 
in  two-predictor  models  that  use  the  first  component  as  one  of  the  two  predictors  and  one 
of  the  explanatory  variables  as  the  second  one,  and  so  forth.  With  such  methods,  there  is 
no  guarantee  that  the  new  components  will  be  predictive  of  the  response  variable,  and  their 
effects  are  not  as  interpretable,  especially  when  p  is  very  large.  However,  in  screening  out 
many  of  the  explanatory  variables  before  the  components  are  formed,  Magidson  cautioned 
against  screening  out  suppressor  variables  that  may  reveal  their  relevance  only  when  some 
other  variable  is  already  in  the  model. 
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7.5.5  Examples:  High-Dimensional  Applications  in  Genomics 

Fan  et  al.  (2010)  described  several  computational  biology  topics  for  which  specialized 
high-dimensional  analyses  have  recently  been  proposed.  These  include  human  genetics  and 
disease  mapping,  discrete  sequence  motif  discovery,  protein  sequence  alignment,  population 
genetics,  evolutionary  models,  and  finite  mixture  clustering  formicroarray  data.  The  amount 
of  molecular  data  is  enormous,  such  as  billions  of  base  pairs  of  DNA  sequence  data  in  the 
GenBank,  and  much  of  the  data  is  of  categorical  form.  A  main  goal  of  many  studies  is 
to  discover  genetic  variations  that  underlie  whether  or  not  a  certain  disease  is  present. 
Methods  for  categorical  data  analysis  can  determine  whether  an  association  exists  between 
a  person’s  genetic  marker  genotype  and  his/her  disease  status. 

To  discover  genetic  markers  that  may  be  associated  with  a  disease,  case-control  structure 
is  often  used  with  cases  and  noncases  of  the  disease  (Li  and  Conti  2009,  Umbach  and  Wein¬ 
berg  1997).  The  most  abundant  variations  in  the  human  genome  are  the  single-nucleotide 
polymorphisms  (SNPs),  and  an  association  analysis  can  study  a  SNP’s  genotype  frequency 
in  a  group  of  diseased  patients  and  a  group  of  controls.  For  a  single  SNP,  the  analysis 
might  refer  to  a  2  x  3  table  that  cross-classifies  (case,  control)  by  the  SNP  possible  pairs  of 
alleles  (AA,  AB,  BB)  for  an  individual’s  genotype.  But  many  studies  for  detecting  genetic 
signals  use  hundreds  of  thousands  of  SNP  markers  genotyped  for  thousands  of  subjects. 
The  effects  detected  are  usually  quite  weak,  with  relative  risks  between  about  1 . 1  and  1 .5. 
Pathway-based  approaches  attempt  to  build  power  by  examining  whether  test  statistics  for 
a  group  of  related  genes  have  consistent  yet  moderate  deviation  from  chance.  Yet,  genetic 
risk  prediction  can  be  challenging  even  when  combining  information  from  various  studies.6 

A  further  complication  in  attempting  to  find  SNP  markers  whose  genotype  frequencies 
are  significantly  different  between  cases  and  controls  is  that  interactions  among  them  may 
affect  the  disease  risk  (Zhang  and  Liu  2007,  Zhang  et  al.  2011).  Allowing  interactions  is 
also  crucial  in  the  building  of  models  for  risk  prediction.  Some  markers  may  have  negligible 
effects  by  themselves  but  considerable  effect  in  combination  with  other  markers.  In  such 
applications,  the  number  of  genotyped  markers  can  be  much  larger  than  the  number  of 
subjects,  and  the  potential  number  of  possible  interaction  combinations  can  be  astronomical 
while  there  may  be  one  or  relatively  few  of  them  that  are  associated  with  the  disease. 

Dudoit  et  al.  (2003)  noted  that  the  biological  question  of  differential  expression  is  a 
multiple  hypothesis  testing  problem:  simultaneously  testing  for  each  of  possibly  thousands 
of  genes  of  the  null  hypothesis  of  no  association  between  the  expression  levels  and  the 
responses  or  covariates.  In  genetic  association  studies,  the  null  hypothesis  is  true  or  close  to 
being  true  in  a  vast  majority  of  cases.  So,  there  is  the  multiplicity  danger  that  most  significant 
results  found  will  be  type  I  errors.  Because  of  this,  often  extremely  stringent  sizes  are  used 
for  significance,  such  as  5xl0~8  instead  of  the  usual  0.05.  Multiplicity  adjustments  such 
as  FDR  are  especially  relevant,  as  is  then  replicating  the  finding  of  significance  in  an 
independent  sample. 

Yet  another  complication  is  that  complex  dependencies  may  exist  among  the  tests 
applied  to  the  many  parameters  for  a  particular  data  set,  which  makes  the  complete  null 
distributions  of  test  statistics  and  subsequent  P-values  unclear.  Permutation  methods  can 
then  be  useful.  An  empirical  distribution  of  P-values  can  be  generated  by  repeating  the 

6See  P.  Kraft  and  D.  J.  Hunter,  “Genetic  risk  prediction — Are  we  there  yet,”  N.  Engl.  J.  Med.  360:  1701-1703, 
2009;  and  K.  Wang  et  al.,  “Analyzing  biological  pathways  in  genome-wide  association  studies,”  Nature  Rev. 
Genetics  11:  843-854,  2010. 
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tests  with  appropriate  permutations  of  the  data,  such  as  by  randomly  permuting  case- 
control  labels,  to  induce  a  null  distribution  of  the  P-  values  while  preserving  the  correlation 
structure.  For  any  particular  P-value,  the  proportion  of  the  permutations  that  give  a  P-value 
smaller  than  the  observed  one  provides  an  adjusted  P-value  (Dudoit  et  al.  2003). 

Zhang  et  al.  (2011)  argued  that  for  genetic  association  studies,  ordinary  penalized 
likelihood  and  stepwise  selection  methods  (e.g.,  that  identify  the  1 0%  most  effective  markers 
and  then  search  for  interactions  among  them)  are  ineffective.  They  summarized  a  Bayesian 
partitioning  model  based  on  a  multinomial  likelihood  with  Dirichlet  prior.  It  is  designed 
to  partition  SNPs  into  ones  that  are  unrelated  to  the  disease,  ones  that  are  marginally 
associated  with  the  disease,  and  ones  that  are  jointly  associated  (in  an  interaction)  with  the 
disease.  The  output  is  a  posterior  probability  for  each  SNP  of  belonging  to  each  of  these 
three  groups.  Zhang  et  al.  (2011)  gave  an  example  in  which  an  important  interaction  was 
detected  between  two  SNPs  for  Crohn’s  disease,  with  data  containing  gentotypes  at  1182 
SNPs  for  about  2000  cases  and  3000  controls.  As  would  be  expected,  posterior  probabilities 
depend  strongly  on  prior  probabilities,  but  the  order  of  posterior  probabilities  for  different 
SNPs  was  little  affected. 

By  contrast,  Park  and  Hastie  (2008)  compared  penalized  likelihood  incorporating  a 
quadratic  penalty  function  to  other  methods  and  found  that  it  performs  well  in  identifying 
relevant  interactions  in  gene-environment  interaction  models.  They  found  that  it  overcomes 
collinearity  among  predictors  and  can  handle  applications  where  the  number  of  predictors 
is  large  relative  to  the  sample  size. 

In  some  applications,  analyses  have  combined  several  methods  we’ve  presented  in  this 
chapter.  In  modeling  binary  prostate  cancer  status,  Liu  et  al.  (2008)  used  a  semiparametric 
logistic  model  with  a  linear  effect  of  age  but  a  nonparametric  function  of  five  genes  within 
the  cell  growth  pathway,  maximizing  a  penalized  binomial  likelihood.  For  the  nonparametric 
function,  rather  than  using  a  smoothing  spline  such  as  in  GAMs,  the  authors  used  a  positive 
definite  kernel  function.  The  nonparametric  approach  for  the  part  of  the  model  involving 
genetic  effects  reflects  the  complex  way  that  genes  may  interact  with  each  other  and  relate 
to  the  response.  The  likelihood  penalty  and  the  kernel  function  both  incorporate  smoothing 
parameters.  In  a  novel  approach,  rather  than  estimate  these  smoothing  parameters  by  cross- 
validation  or  by  trial-and-error  inspection,  Liu  et  al.  (2008)  treated  them  like  variance 
components  in  a  random  effects  model. 


7.5.6  Example:  Motif  Discovery  for  Protein  Sequences 

Flere  is  an  example  that  illustrates  the  severity  of  the  challenge  in  many  genetics  applications. 
Liu  et  al.  (1995)  proposed  a  multinomial  mixture  model  for  motif  discovery  for  protein 
sequences,  using  probability  vectors  for  particular  motif  sites  mixed  with  probabilities  that 
apply  when  an  observation  does  not  belong  to  any  motif  site.  The  data  take  the  form  of  very 
long  sequences  of  the  four  nucleotides  which  make  up  the  DNA,  commonly  abbreviated 
by  A,  C,  G,  and  T.  Understanding  many  biological  processes  involves  identifying  relatively 
short  patterns  of  these  embedded  in  long  strings.  It  is  beyond  our  scope  to  explain  the  details, 
but  the  challenging  aspect  of  it  is  shown  by  the  general  setting  in  which  Liu  et  al.  (1995) 
described  the  problem:  Consider  L  coins,  of  which  L  —  J  all  have  the  same  probability 
7i  of  head,  and  the  remaining  J  have  probabilities  of  heads  different  from  n  and  different 
from  one  another.  In  each  of  K  independent  trials,  the  L  coins  are  flipped  and  laid  out  in  a 
row  such  that  J  special  coins  are  in  a  contiguous  block  in  the  same  order  in  each  trial,  but 
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with  unknown  location  that  can  vary  from  trial  to  trial.  The  challenge  is  to  estimate  all  the 
probabilities  and  identify  the  locations  of  the  special  coins  in  each  trial. 

More  generally,  there  may  be  multiple  possible  outcomes  for  each  coin  (such  as  A,  C, 
G,  and  T),  there  could  be  multiple  blocks  of  the  special  coins  in  each  trial,  the  special  coins 
might  not  occur  in  a  contiguous  block,  and  it  may  be  of  interest  to  test  the  hypothesis  that 
there  are  special  coins  with  different  probabilities  from  the  common  probability  of  the  other 
coins.  Liu  et  al.  (1995)  and  Jensen  et  al.  (2004)  provided  Bayesian  solutions,  modeling  the 
sequence  patterns  as  a  product  multinomial  using  Dirichlet  priors  and  treating  the  motif 
finding  problem  as  a  missing  data  problem  (because  the  motif  locations  are  not  observed). 
The  motif  sequence  frequencies  can  be  well  estimated  with  a  large  number  of  fragments, 
but  the  number  of  possibilities  grows  exponentially  with  the  number  of  fragments  because 
of  the  unobserved  locations. 


7.5.7  Example:  The  Netflix  Prize 

Netflix  is  a  company  that  distributes  movies  to  subscribers  through  the  mail  with  DVDs 
and  by  Internet  streaming.  For  each  movie  viewed,  a  subscriber  can  rate  the  movie  on  a 
five-point  ordinal  scale.  Based  on  its  accumulated  records  of  such  ratings,  Netflix  gives  the 
subscriber  a  predicted  rating  for  any  movie  that  that  subscriber  could  choose  to  watch. 

In  2006  Netflix  announced  a  competition  for  developing  an  algorithm  for  predicting 
movie  reviews.  The  winner  would  be  awarded  a  $  1  million  prize  if  the  solution  provided  at 
least  a  10%  improvement  in  predictions,  in  terms  of  a  particular  root  mean  squared  metric, 
over  the  algorithm  then  in  use  by  Netflix.  The  training  set  consisted  of  about  100  million 
evaluations  on  1 8,000  movies  made  by  almost  500,000  Netflix  subscribers.  The  average 
number  of  ratings  per  subscriber  was  208,  a  very  small  fraction  of  the  possible  movies  for 
evaluation. 

For  simplicity  here,  we’ll  imagine  a  binary  positive  versus  negative  rating  rather  than 
a  five-point  rating.  One  way  to  portray  the  data  set  is  then  as  a  1 8,000  x  500,000  movie- 
by-subscriber  matrix,  in  which  a  binary  rating  is  shown  in  100  million  of  the  cells  and  the 
other  cells  have  no  data.  Another  portrayal  of  the  data,  but  impractical,  is  as  a  318  000  cell 
contingency  table  that  cross-classi fies  ratings  using  categories  (positive,  negative,  unrated) 
on  the  1 8,000  movies. 

For  a  given  subscriber  for  whom  you  have  evaluations  on  some  subset  of  the  movies, 
how  do  you  predict  that  subscriber’s  rating  on  some  other  movie?  It’s  unclear  how  to  do 
this  with  standard  modeling  methods.  Logistic  regression  is  not  readily  relevant,  as  no  other 
subscriber  may  have  seen  the  same  movies  as  well  as  the  one  to  be  rated.  Using  existing 
ratings,  we  could  measure  the  similarity  between  subscribers  and/or  the  similarity  between 
movies  and  then  apply  a  nearest  neighbor  smoothing  approach.  With  subscribers  as  the 
units,  we  predict  the  movie  rating  by  averaging  ratings  for  that  movie  by  subscribers  with 
similar  opinions  on  jointly-watched  movies.  With  movies  as  the  units,  we  predict  the  movie 
rating  by  averaging  ratings  by  that  subscriber  for  similar  movies.  Although  simple,  such  an 
approach  depends  on  a  choice  of  similarity  metric,  there  may  be  few  or  no  close  neighbors 
for  some  movies,  and  many  highly  correlated  neighbors  for  a  movie  may  result  in  that  set 
receiving  too  much  weight. 

An  alternative  approach,  described  by  the  team  that  won  the  Netflix  competition  (Bell 
et  al.  2010),  used  latent  variables.  Both  subscribers  and  movies  were  summarized  on  a 
latent  vector  of  much  smaller  dimension,  where  the  components  of  the  vector  referred  to 
characteristics  such  as  amount  of  violence,  the  amount  of  drama  versus  comedy,  independent 
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versus  large-budget  Hollywood  star-driven  movies,  and  characteristics  that  might  not  have 
a  ready  interpretation.  A  subscriber’s  rating  was  based  on  the  inner  product  of  the  values 
of  the  subscriber’s  latent  vector  and  the  movie’s  latent  vector.  The  model  was  fitted  with 
/_2-norm  penalized  likelihood  methods.  Extensions  of  the  method  used  the  fact  that  the  set 
of  movies  a  subscriber  chooses  to  rate  is  an  additional  source  of  information  about  that 
person’s  tastes,  and  each  subscriber’s  parameters  could  gradually  vary  over  time. 

Bell  et  al.  (2010)  explained  that  an  ensemble  method  that  generates  multiple  predictions 
from  a  variety  of  methods  and  then  averages  over  them  tends  to  generate  better  predictions 
than  any  one  method.  For  the  Netflix  prize,  this  ensemble  method  incorporated  both  latent 
variable  and  nearest  neighbor  models. 


7.5.8  Example:  Credit  Scoring 

Credit  scoring  is  the  term  describing  methods  for  classifying  applicants  for  credit  into 
“good”  and  “bad”  risk  classes.  According  to  Hand  and  Henley  (1997),  many  credit  scoring 
databases  have  more  than  100,000  applicants  measured  on  more  than  100  variables.  The 
probability  that  an  applicant  will  default  must  be  estimated  based  on  characteristics  such  as 
annual  income,  occupation,  marital  status,  age,  post  code,  credit  card  possession,  length  of 
time  at  current  address,  type  of  bank  account,  court  judgments,  time  with  employer,  time 
with  bank,  and  details  of  loan  payments. 

Here,  although  p  is  large,  n  is  much  larger,  so  the  challenges  are  not  as  severe  as 
in  cases  where  p  exceeds  n.  Methods  used  include  logistic  regression,  nearest  neighbor 
smoothing,  and  non-model-based  classification  methods  presented  in  Chapter  15,  such  as 
linear  discriminant  analysis.  A  complication  is  lots  of  missing  values,  although  this  itself 
can  be  a  useful  indicator  for  classification.  Inherently  continuous  variables  such  as  income 
are  measured  with  discrete  categories,  and  expert  knowledge  may  impose  monotonicity 
constraints  on  effects  for  different  levels  of  a  factor  (e.g.,  such  that  the  probability  of 
default  decreases  as  a  function  of  income,  adjusting  for  other  variables).  When  the  risk 
classes  are  not  well  separated,  the  response  probability  is  a  rather  flat  function  and  there  is 
the  danger  of  overfitting,  so  penalized  likelihood  and  other  smoothing  mechanisms  can  be 
useful.  The  sample  is  usually  far  from  random,  which  limits  possible  statistical  inference. 

A  goal  in  selecting  predictors  is  to  include  ones  that  discriminate  well  between  default 
and  nondefault  outcomes.  The  success  in  identifying  default  cases  can  be  summarized  with 
standard  tools  such  as  classification  tables  and  ROC  curves,  plotting  the  true  positive  rate 
against  the  false  positive  rate  for  various  probability  thresholds  for  predicting  default. 

Other  business-related  applications  can  have,  by  contrast,  enormous  values  for p  as  well 
as  n.  Examples  are  market  basket  data  and  website  browsing  behavior,  as  described  at  the 
beginning  of  Section  15.3. 


NOTES 

Section  7.1 :  Probit  and  Complementary  Log-Log  Models 

7.1  Probits/log-log:  Finney  ( 197 1 )  is  a  standard  reference  on  probit  modeling.  Ashford  and  Sowden 
(1970)  generalized  the  probit  model  for  multivariate  binary  responses;  see  also  Lesaffre  and 
Molenberghs  (1991)  and  Ochi  and  Prentice  (1984).  Wedderbum  (1976)  showed  that  the  log 
likelihood  function  is  concave  for  probit  and  complementary  log-log  links. 
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7.2  Utility/extreme-value/logit/probit:  For  the  utility  model  in  Section  7.1.1,  suppose  ev  are  inde¬ 
pendent  extreme-value  random  variables  (instead  of  normal),  with  cdf  F(e)  =  exp[—  exp(— c)J. 
Then.  McFadden  ( 1974)  showed  that  P(Y  =  I )  satisfies  the  logistic  regression  model,  because 
the  difference  between  two  extreme-value  random  variables  has  the  logistic  distribution.  See 
also  Amemiya  (1981)  and  Maddala  ( 1 983,  p.  60).  Chambers  and  Cox  ( 1 967)  showed  that  it  is 
difficult  to  distinguish  between  models  using  probit  and  logit  links  unless  n  is  extremely  large. 

7.3  Other  link  functions:  Other  link  functions  have  been  proposed  for  binary  data,  including  the 
inverse  of  the  cdf  of  a  t  distribution  (for  which  the  probit  is  the  limiting  case  as  df  — ►  oo), 
a  log-gamma  link  (Genter  and  Farewell  1985)  for  which  probit,  complementary  log-log,  and 
log-log  are  special  cases,  and  a  weighted  average  of  logit,  log-log,  and  complementary  log-log 
links  (Lang  1 999).  Prentice  (1976b)  and  Stukel  ( 1 988)  extended  the  scope  of  logistic  regression 
by  introducing  shape  parameters  that  modify  the  behavior  of  the  curve  in  extreme  probability 
regions  and  allow  for  asymmetric  treatment  of  the  two  tails.  Prentice  (1975,  1976b)  used 
the  inverse  cdf  of  the  logarithm  of  an  F  random  variable,  for  which  df i  —  df2  =  2  gives  the 
logistic.  Guerrero  and  Johnson  (1982)  applied  the  Box-Cox  power  transformation  to  the  odds, 
for  which  the  logit  is  a  special  case.  For  other  generalizations,  see  Aranda-Ordaz  ( 1981 ),  Kateri 
and  Agresti  (2010),  and  Pregibon  (1980). 


Section  7.2:  Bayesian  Inference  for  Binary  Regression 

7.4  Bayes  literature:  Racine  et  al.  ( 1986)  used  Bayesian  methods  to  obtain  a  posterior  interval  for 
LD5o  for  the  probit  model.  Chaloner  and  Lamtz  ( 1989)  used  Bayesian  methods  to  determine 
optimal  experimental  design  for  logistic  regression.  For  other  Bayesian  work  on  case-control 
studies,  see  Li  and  Conti  (2009),  Mukherjee  and  Chatterjee  (2008),  and  Muller  and  Roeder 
(1997).  For  Bayesian  item  response  modeling,  Tsutakawa  and  Lin  ( 1986)  specified  prior  dis¬ 
tributions  on  response  probabilities  and  used  them  to  induce  priors  on  model  parameters,  the 
approach  extended  to  binary  regression  models  by  Bedrick  et  al.  (1997)  and  Christensen  et  al. 
(2010).  Ghosh  and  Mukherjee  (2010)  surveyed  Bayesian  work  on  item  response  modeling. 
For  binary  regression,  Zellner  and  Rossi  ( 1984)  used  Monte  Carlo  methods  with  importance 
sampling,  giving  particular  attention  to  multivariate  normal  priors,  Chen  et  al.  (2008)  and 
Ibrahim  and  Laud  (1991)  used  the  Jeffreys  prior,  and  Wong  and  Mason  (1985)  considered 
multilevel  models.  Dey  et  al.  (2000)  edited  a  book  on  Bayesian  analyses  for  GLMs.  The  2010 
text  Frontiers  of  Statistical  Decision  Making  and  Bayesian  Analysis  in  honor  of  James  Berger 
contains  a  chapter  on  "Bayesian  Categorical  Data  Analysis”  that  has  separate  contributions  on 
smoothing  (by  J.  Albert),  on  matched-pairs  binary  data  (by  M.  Ghosh  and  B.  Mukherjee),  and 
on  the  choice  of  link  functions  for  binary  data  (by  M.-H.  Chen  and  colleagues).  Agresti  and 
Hitchcock  (2005),  Congdon  (2005),  Leonard  (1999),  and  Leonard  and  Hsu  (1994)  surveyed 
Bayesian  methods  for  categorical  data. 


Section  7.3:  Conditional  Logistic  Regression 

7.5  Conditional  logistic  and  exact:  For  more  details  about  conditional  logistic  regression,  see 
Section  11.2,  Breslow  (1976),  Breslow  and  Powers  (1978),  Breslow  et  al.  (1978).  Breslow 
and  Day  (1980,  Chap.  7),  Cox  (1970),  Farewell  (1979),  Hosmer  and  Lemeshow  (2000,  Chap. 
5),  Lloyd  (1999,  Chap.  7),  Prentice  (1976a).  and  Prentice  and  Breslow  (1978).  Liang  (1984) 
showed  that  conditional  ML  estimators  and  conditional  score  tests  are  asymptotically  equivalent 
to  their  unconditional  counterparts  under  sampling  from  exponential  families.  For  more  on 
exact  inference  using  conditional  distributions  for  contingency  tables  and  logistic  regression, 
see  Sections  16.5  and  16.6,  Agresti  (1992),  Hirji  et  al.  (1987),  Mehta  and  Patel  ( 1995),  and  the 
StatXact  and  LogXact  manuals  (Cytel  Software).  Mehta  et  al.  (2000)  discussed  Monte  Carlo 
approximations.  For  improved  higher-order  asymptotic  methods,  see  Brazzale  and  Davison 
(2008),  Brazzale  et  al.  (2007),  and  Davison  et  al.  (2006). 
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Section  7.4:  Smoothing:  Kernels,  Penalized  Likelihood,  Generalized  Additive  Models 

7.6  Nearest  neighbors  and  other  kernels:  See  Hastie  et  al.  (2009,  Chap.  1 3)  and  references  therein 
for  details  about  the  nearest  neighbor  method.  Natural  application  areas  are  ones  for  which  the 
data  occur  in  physical  space.  See  Besag  (1974).  Smoothing  methods  for  binary  data  extend  to 
multinomial  responses.  See  Aitchison  and  Aitken  (1976)  and  Exercise  7.32  fora  simple  kernel 
smoother.  Hall  and  Titterington  (1987)  studied  rates  of  convergence  for  multinomial  kernel 
estimators  and  defined  one  that  achieves  the  optimal  rate.  Ordinary  kernel  estimators  tend  to  be 
biased  toward  zero  at  the  boundary  of  a  table.  Dong  and  Simonoff  (1994)  dealt  with  improving 
kernel  estimates  on  the  boundary  of  large,  sparse  contingency  tables. 

7.6  Penalized  likelihood,  GAMs,  smoothing  surveys:  Good  and  Gaskins  (1971)  introduced  pe¬ 
nalized  likelihood  methods.  For  more  about  the  lasso,  see  Hastie  et  al.  (2009,  Sec.  3.4,  3.8). 
Simonoff  (1983,  1996,  1998)  proposed  penalized  likelihood  methods  for  multinomial  data. 
Kauermann  and  Tutz  (2001)  proposed  likelihood-ratio  goodness-of-fit  tests  of  GLMs  and 
GAMs  against  smooth  alternatives.  Yee  and  Wild  (1996)  defined  generalized  additive  models 
for  nominal  and  ordinal  responses.  See  also  Hastie  and  Tibshirani  ( 1 990)  and  Tutz  (201 1 ,  Sec. 
10.3.2).  For  surveys  of  smoothing  methods,  see  Fahrmeir  and  Tutz  (2001,  Chap.  5),  Lloyd 
(1999,  Chap.  5),  Simonoff  (1996,  Chap.  6;  1998),  and  Tutz  (201 1,  Chap.  6,  10).  See  Albert 
(2010)  for  Bayesian  smoothing  methods. 


Section  7.5:  Issues  in  Analyzing  High-Dimensional  Categorical  Data 

7.8  Regularization  with  large p:  Fan  and  Lv  (2010)  and  Tutz  (201 1  (reviewed  penalized  likelihood 
methods  for  variable  selection  in  high  dimensions.  Genkin  et  al.  (2007)  proposed  a  type  of 
Bayesian  lasso  for  logistic  regression.  Meier  et  al.  (2008)  extended  the  lasso  to  do  variable 
selection  on  predefined  groups  of  variables  and  suggested  a  penalty  term  that  is  intermediate 
between  a  lasso  and  a  quadratic  penalty. 

7.9  Multiple  testing:  For  variable  selection  procedures,  Westfall  and  Wolfinger  ( 1 997)  and  Westfall 
and  Young  (1993)  presented  ways  to  adjust  P-values  to  take  multiple  tests  into  account,  the 
first  reference  focusing  on  discrete  distributions.  Dudoit  et  al.  (2003)  and  Farcomeni  (2008) 
surveyed  the  issues  in  large-scale  multiple  hypothesis  testing.  Benjamini  and  Hochberg  (1995) 
noted  that  their  approach  corresponds  to  a  constrained  maximization  problem  that  chooses  a 
level  a *  for  each  test  that  maximizes  the  number  of  rejections  «+2  subject  to  the  constraint 
a*g/n+ 2  <  o'. 


EXERCISES 

Applications 

7.1  Refer  to  Exercise  5.2  on  cancer  remission  with  labeling  index  (LI)  as  predictor. 
Table  7.7  shows  output  for  fitting  a  probit  model.  Interpret  the  parameter  estimates 


Table  7.7  Output  for  Exercise  7.1  on  Probit  Model  for  Cancer  Remission 


Likelihood 


Standard 

Ratio 

95% 

Parameter 

Estimate 

Error 

Confidence 

Limits 

Chi-Square 

Pr  >  ChiSq 

Intercept 

-2.3178 

0.7795 

-4.0114 

-0.9084 

8.84 

0 . 0029 

LI 

0 . 0878 

0 . 0328 

0 . 0275 

0 . 1575 

7 .19 

0 . 0073 

288 


ALTERNATIVE  MODELING  OF  BINARY  RESPONSE  DATA 


(a)  using  characteristics  of  the  normal  cdf  response  curve,  (b)  finding  the  estimated 
rate  of  change  in  the  probability  of  remission  where  it  equals  0.50,  (c)  finding  the 
difference  between  the  estimated  probabilities  of  remission  at  the  upper  and  lower 
quartiles  of  LI,  28  and  14,  and  (d)  describing  the  effect  of  LI  on  an  underlying  latent 
variable  for  remission. 

7.2  For  the  horseshoe  crab  data  (Table  4.3),  fit  a  probit  model  to  describe  the  effects 
of  width  and  color  as  a  factor  on  the  probability  of  a  satellite.  Interpret  effects  and 
conduct  inference. 

7.3  For  the  flour  beetle  mortality  data  in  Table  7.1,  fit  models  using  the  (a)  logit,  (b) 
probit,  (c)  complementary  log-log,  and  (d)  log-log  links,  with  dosage  entered  in  the 
model  in  standardized  form.  Report  and  interpret  model  parameter  estimates.  What 
aspect  of  the  data  pattern  causes  the  model  with  log-log  link  to  fit  so  poorly? 

7.4  For  Table  5.3  on  maternal  alcohol  consumption  and  child’s  congenital  malforma¬ 
tions,  report  the  posterior  mean  and  standard  deviation  and  a  95%  posterior  interval 
(equal-tail  or  FIPD)  for  fi  in  the  linear  logit  model  with  scores  (0,  0.5,  1.5,  4.0, 
7.0).  Use  (a)  the  N( 0,  10002)  prior  on  model  parameters  and  (b)  the  N( 0,  1)  prior. 
Compare  results  to  those  obtained  with  ML. 

7.5  Refer  to  the  previous  exercise.  Conduct  Bayesian  analyses  with  the  probit  link,  using 
prior  distributions  that  (on  the  probit  scale)  are  comparable  to  the  priors  used  for  the 
logit  link.  Compare  results  to  those  obtained  with  ML  probit  and  with  the  Bayesian 
logistic  analysis. 

7.6  For  the  horseshoe  crab  data  (Table  4.3)  available  at  the  text  website,  conduct  a 
Bayesian  analysis  for  the  logistic  model  with  width  and  dark  color  indicator  as 
predictors  of  the  probability  of  satellites,  using  relatively  uninformative  normal 
priors.  Interpret  results,  and  compare  them  to  the  ML  fit. 

7.7  Refer  to  the  example  on  endometrial  cancer  in  Section  7.2.2.  To  obtain  analyses  for 
the  corresponding  probit  model  that  give  similar  substantive  results,  how  would  you 
need  to  change  a  from  the  values  of  10  and  1  used  in  the  logistic  analyses?  Conduct 
such  analyses,  and  report  and  interpret  the  posterior  mean  and  standard  deviation 
and  a  95%  equal-tail  or  FIPD  posterior  interval  for  fi\ . 

7.8  In  Exercise  6.20,  the  main  effects  logistic  model  had  all  ML  estimates  infinite. 
By  contrast,  conduct  a  Bayesian  analysis  with  independent  N(0 ,  a2)  priors.  Show 
how  the  posterior  mean  and  standard  deviation  estimates  of  the  effects  compare  for 
a  =■  1 ,  a  ■=  1 0,  and  a  =  100. 

7.9  Construct  an  artificial  example  of  binary  data  with  a  single  quantitative  predictor 
and  select  a  normal  prior  such  that  Bayesian  inference  gives  substantively  different 
results  than  frequentist  inference.  Discuss  the  factors  that  cause  the  results  to  be  so 
different. 

7.10  Construct  a  2  x  2  table  for  a  binary  response  and  binary  predictor*  such  that  analyses 
you  conduct  for  a  logistic  or  probit  model  with  linear  predictor  a  +  fix  have  ML 
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estimate  $  —  oo  but  Bayesian  posterior  mean  and  posterior  intervals  for  ft  that  are 
finite. 

7.11  For  the  2  x  2  table  with  counts  (by  row)  of  (3,  1  /  1 , 3),  conduct  a  Bayesian  analysis 
for  the  model,  logit[F(Y  =  1)]  =  a  +  fix,  using  N( 0,  1)  priors  for  a  and  p. 

a.  With  x  coded  as  (1,  0),  find  the  posterior  mean  and  standard  deviation  for  ft. 

b.  With  x  coded  as  (0.5,  —0.5),  find  the  posterior  mean  and  standard  deviation  for  (3. 
Compare  to  (a),  and  explain  why  results  differ  somewhat.  Would  different  results 
happen  for  $  and  its  SE  with  frequentist  analyses  for  these  different  codings  of  x? 

7.12  For  the  2x2x5  Table  6.16,  the  small  cell  counts  make  large-sample  analy¬ 
ses  questionnable.  Conduct  a  small-sample  test  of  conditional  independence,  and 
interpret. 

7.13  Table  7.8  comes  from  a  1987  study  of  nonmetastatic  osteosarcoma  (A.  Goorin, 
J.  Clin.  Oncol.  5:  1 178-1 184,  and  LogXact  manual).  The  response  is  whether  the 
subject  achieved  a  three-year  disease-free  interval. 

a.  Show  that  each  predictor  has  a  significant  effect  when  used  individually  without 
the  others. 

b.  Try  to  fit  a  main-effects  logistic  regression  model  containing  all  three  predictors. 
Explain  why  the  ML  estimate  for  the  effect  of  lymphocytic  infiltration  is  infinite. 

c.  Using  conditional  logistic  regression,  conduct  an  exact  test  for  the  effect  of 
lymphocytic  infiltration,  controlling  for  the  other  variables.  Interpret  results. 


Table  7.8  Data  for  Exercise  7.13 


Lymphocytic 

Infiltration 

Gender 

Osteoblastic 

Pathology 

Disease-Free 

Yes  No 

High 

Female 

No 

3 

0 

Yes 

2 

0 

Male 

No 

4 

0 

Yes 

1 

0 

Low 

Female 

No 

5 

0 

Yes 

3 

2 

Male 

No 

5 

4 

Yes 

6 

11 

Source:  LogXact  7.  Cambridge,  MA:  CYTEL  Software,  2005,  p.  171. 


7.14  Using  formula  (7.13),  smooth  the  data  in  Exercise  5.22  relating  income  to  having 
a  travel  credit  card.  Graph  results  for  three  values  of  the  smoothing  parameter, 
corresponding  to  ones  that  (in  your  opinion)  smooth  too  much,  too  little,  and  about 
the  right  amount. 

7.15  Use  the  Firth  penalized  likelihood  method  to  obtain  finite  estimates  for  the  data  in 
Figure  6.5.  Compare  the  95%  profile  penalized  likelihood  confidence  interval  for  p 
to  the  corresponding  ordinary  profile  likelihood  interval. 


290 


ALTERNATIVE  MODELING  OF  BINARY  RESPONSE  DATA 


7.16  Using  a  generalized  additive  model,  construct  a  figure  like  Figure  5.2  for  the  horse¬ 
shoe  crab  data,  but  using  weight  instead  of  width  as  the  predictor. 

7.17  Smooth  the  count  data  in  Figure  4.4  using  generalized  additive  models.  Graph 
results  for  three  values  of  the  smoothing  parameter,  corresponding  to  ones  that  (in 
your  opinion)  smooth  too  much,  too  little,  and  about  the  right  amount. 

7.18  The  credit-scoring  data  file  at  www .  statistik .  lmu.  de/ service/datenarchiv/ 
kredit/kredit_e .  html  includes  20  covariates  for  1000  observations.  Build  a 
model  for  credit-worthiness,  using  as  potential  predictors:  running  account,  duration 
of  credit,  payment  of  previous  credits,  intended  use,  gender,  marital  status. 

7.19  Project:  Go  to  a  site  with  large  data  files,  such  as  the  UCI  Machine  Learning 
Repository  (archive .  ics  .  uci  .  edu/ml).  Find  a  data  set  of  interest  to  you  that 
has  a  binary  response  variable.  Use  at  least  one  method  discussed  in  this  chapter 
to  learn  something  about  the  data.  Summarize  your  analyses  in  a  two-page  report, 
attaching  an  appendix  showing  your  use  of  software. 


Theory  and  Methods 

7.20  Refer  to  the  threshold  model  used  in  Section  7.1.1  to  motivate  the  probit  model. 

a.  For  identifiability,  explain  why  you  can  set  a  =  1  and  x  —  0.  Explain  why  ft  then 
represents  the  expected  number  of  standard  deviation  change  in  Y*  for  a  1-unit 
increase  in  x. 

b.  Suppose  you  fitted  this  model  separately  to  each  of  two  groups  and  wanted  to 
compare  the  effects  of  x  for  those  groups.  Suppose  that  the  two  groups  had 
different  residual  variability  for  their  underlying  latent  variable.  Explain  why 
even  if  the  ft  parameters  are  identical  for  the  two  groups  for  the  latent  model,  the 
corresponding  effect  parameters  are  not  the  same  for  the  probit  (or  logistic)  models 
actually  used.  [Allison  (1999)  discussed  this  issue  and  proposed  an  alternative 
way  of  comparing  coefficients  that  can  adjust  for  unequal  residual  variation.] 

7.21  For  independent  binary  {y,},  from  scratch  (without  using  any  results  for  GLMs) 
show  that  the  likelihood  equations  for  the  logistic  and  probit  regression  models  are 

]T(y;  -  7ti)ZiX, j  =  0,  j  =  0, . . . ,  p, 

i 

where  z,  =  1  for  the  logistic  case  and  z,  =  <p(%2j  ftj^ift/xcftft  —  A,  )  for  the  probit. 

7.22  For  the  linear  probability  model  zr,  =  a  4-  ftx,  applied  with  independent  binary  [y, }, 
show  that  the  likelihood  equations  are 


ni  :  :.)  ^c;  ;  .;)•  - 


EXERCISES 


291 


7.23  Derive  the  estimated  asymptotic  covariance  matrix  of  fi  for  the  probit  model  from 
the  GLM  expression  (4.31).  [Hint:  Recall  that  in  the  binomial  case  in  Section  4.4, 
y,  is  the  proportion  of  successes.] 

7.24  Consider  model  (7.3)  with  complementary  log-log  link. 

a.  Find  x  at  which  n(x)  =  i. 

b.  Show  the  greatest  rate  of  change  of  n(x)  occurs  at  x  =  —a/fi.  What  does  n(x) 
equal  at  that  point?  Give  the  corresponding  result  for  the  model  with  log-log  link, 
and  compare  to  the  logistic  and  probit  models. 

7.25  For  the  log-log  model  (7.4),  explain  how  to  interpret  ft. 

7.26  Find  the  likelihood  equations  and  apply  (4.31)  to  find  the  form  of  the  asymptotic 
covariance  matrix  of  /?  for  a  binary  GLM  using  the  complementary  log-log  link 
function. 

7.27  In  logistic  regression,  suppose  you  use  a  Bayesian  approach  with  an  uninformative 
prior  such  as  N(0 ,  10002)  for  each  model  parameter.  For  any  particular  setting  of 
the  explanatory  variables,  explain  why  this  implies  that  nearly  all  the  prior  weight 
is  placed  on  probability  values  very  close  to  0  and  very  close  to  1. 

7.28  In  a  binary  regression  model,  one  of  the  explanatory  variables  is  binary.  For  Bayesian 
fitting  of  the  model,  what  is  the  reason  for  using  coding  such  as  (0.5,  —0.5)  or 
(1 ,  —  1 )  for  levels  of  the  binary  predictor,  instead  of  the  usual  (1,0)  indicator  coding? 

7.29  For  interval  estimation  of  a  logistic  regression  model  parameter  /J;  ,  explain  why  the 
Bayesian  highest  posterior  density  interval  is  appropriate  for  f$j  but  not  for  the  odds 
ratio  effect  exp(/J;).  [Hint:  See  Section  3.6.5.] 

7.30  For  independent  binomial  sampling  with  the  model  logit(7r,x.)  =  a  +  fiXj  +  fif  for 
a  2  x  2  x  K  table,  construct  the  log  likelihood  and  identify  the  sufficient  statistics 
to  be  conditioned  out  to  perform  exact  inference  about  ft. 

7.31  Refer  to  the  kernel  smoother  (7. 12).  Show  that  fir,  =  1  if  and  only  if  the  column 
totals  of  K  equal  1 . 

7.32  For  a  multinomial  distribution  with  c  unordered  categories,  Aitchison  and  Aitken 
(1976)  proposed  a  kernel  estimator  of  form  (7. 1 2),  having 

ktj  =  y,  i  =  j 

=  (1  -  Y)/(c  -  1),  i±j 


for  ( 1  fc)  <  y  <  1 . 

a.  Show  that  the  resulting  kernel  estimator  of  n  has  form 


(1  -X)p  +  Ml/c), 
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where  A.  =  c(l  —  y)/(c  —  1),  which  shrinks  the  sample  proportions  toward 

(1/c, .... 

b.  Show  that  as  y  decreases  from  1  to  1  /c,  X  increases  from  0  to  1 .  [Brown  and  Run- 
dell  (1985)  proved  that  when  noi;  =  1 ,  a  X  value  exists  such  that  the  total  mean 
squared  error  is  smaller  for  this  kernel  estimator  than  for  the  sample  proportions.] 

c.  Show  that  the  kernel  estimator  of  form  in  (a)  is  the  same  as  the  Bayes  estimator 
(1.19)  for  the  Dirichlet  prior  with  [a,  =  Xn/(  1  —  A)c).  Using  this  result,  suggest 
a  way  of  letting  the  data  determine  the  value  of  X  in  the  kernel  estimator. 


7.33  Refer  to  Copas’s  kernel  smoother  (7.13)  for  binary  regression,  with  cp(u)  = 
exp(— tr/2). 

a.  To  describe  how  close  this  estimator  falls  at  a  particular  x  value  to  a  corresponding 
smoothing  in  the  population,  use  the  delta  method  to  show  that  an  estimated 
asymptotic  variance  is 


if(Y)[l  -  if  (a)] 


E/0[V 2(*  -Xj)/X] 
{£,0 [(*  ~  Xi)/X])2 ' 


Explain  why  this  decreases  as  X  increases,  and  explain  the  implication, 
b.  As  X  increases  unboundedly,  explain  intuitively  why  fi(x)  converges  to  p  = 
(£,.  _v; ) / n  and  this  estimated  asymptotic  variance  is  approximately  p(  1  —  p)/n. 


7.34  Use  a  probabilistic  argument  to  prove  that  the  Bonferroni  method  works. 


7.35  In  Table  7.6,  explain  why  (a)  P(n  12  >  0)  is  a  familywise  error  rate  (FWER),  (b) 
E(n\i)  is  a  per-family  error  rate  (PFER),  and  (c)  E(n\2)/g  is  a  per-comparison 
error  rate  (PCER). 


7.36  Refer  to  the  previous  exercise. 

a.  Explain  why  multiple  testing  procedures  satisfy  PCER  <  FWER  <  PFER.  Explain 
why  for  a  fixed  level  for  type  I  error  rates,  a  procedure  that  controls  the  PFER  is 
most  conservative,  leading  to  the  fewest  rejections  of  null  hypotheses. 

b.  If  hypothesis  Hj  is  tested  at  level  otj,  j  =  1 , . . . ,  g,  then  under  the  complete  null 
hypothesis,  explain  why  (Dudoit  et  al.  2003) 

\  .  Ql  ■  _ ^ 

PCER  =  — - — -  <  max(o!| , ....  otf,)  <  FWER  <  PFER  =  otj. 

8  J 

7.37  Read  one  of  the  genomics  papers  cited  in  Section  7.5.5  and  prepare  a  two-page 
report  summarizing  its  main  contributions. 


CHAPTER  8 


Models  for  Multinomial  Responses 


In  Chapters  5,  6,  and  7,  we  modeled  binary  response  variables  with  binomial  GLMs. 
Multicategory  responses  use  multinomial  GLMs.  In  this  chapter  we  generalize  logistic 
regression  to  handle  multinomial  response  variables,  with  separate  models  for  nominal  and 
ordinal  cases. 

In  Section  8.1  we  present  a  model  for  nominal  responses.  It  uses  a  separate  binary 
logistic  equation  for  each  pair  of  response  categories.  In  Section  8.2  we  present  a  model 
for  ordinal  responses,  using  logits  of  cumulative  response  probabilities.  In  Section  8.3  we 
use  other  link  functions  for  those  cumulative  probabilities  and  consider  alternative  ordinal 
logit  models. 

In  Section  8.4  we  present  tests  of  conditional  independence  with  multinomial  responses 
using  models  and  using  generalizations  of  the  Cochran-Mantel-Haenszel  statistic.  In  Sec¬ 
tion  8.5  we  introduce  a  multinomial  logit  model  for  discrete-choice  modeling  of  a  subject’s 
choice  from  one  of  several  options  when  values  of  predictors  may  depend  on  the  option. 
The  final  section  discusses  Bayesian  methods  for  multinomial  response  modeling. 


8.1  NOMINAL  RESPONSES:  BASELINE-CATEGORY  LOGIT  MODELS 

For  a  nominal-scale  response  variable  Y  with  J  categories,  multicategory  (also  called 
polytomous )  logistic  models  simultaneously  describe  the  log  odds  for  all  pairs  of 
categories.  Given  a  certain  choice  of  J  —  1  of  these,  the  rest  are  redundant. 

8.1.1  Baseline-Category  Logits 

Let7T/(jc)  =  P(Y  =  j\x)  at  a  fixed  setting  x  for  explanatory  variables,  with  ]TL  ttj(x)  =  1. 
For  observations  at  that  setting,  we  treat  the  counts  at  the  J  categories  of  Y  as  a  multinomial 
variate  with  probabilities  {zr i  (jc),  . . . ,  ttj(x)}.  Logistic  models  pair  each  response  category 
with  a  baseline  category,  such  as  the  last  one  or  the  most  common  one.  Consider  the  model 

log  “4-7  =<*j  +P7jX,  j  J  —  1.  (8.1) 

7T  ,1  VI  J 


Categorical  Data  Analysis,  Third  Edition.  Alan  Agresti. 

©  2013  John  Wiley  &  Sons,  Inc.  Published  2013  by  John  Wiley  &  Sons,  Inc. 


293 


294 


MODELS  FOR  MULTINOMIAL  RESPONSES 


The  left-hand  side  is  the  logit  of  a  conditional  probability,  logit[P(T  =  j\Y  =  j  or  Y  = /)]. 
This  model  simultaneously  describes  the  effects  of  x  on  these  J  —  1  logits.  The  effects 
vary  according  to  the  response  paired  with  the  baseline.  These  J  —  1  equations  determine 
parameters  for  logits  with  other  pairs  of  response  categories,  since 


log 


Kg(x) 

Xb(x) 


=  log 


na{x) 

Ttj(x) 


log 


Xb(x) 

Xj(x)' 


With  categorical  predictors,  X 2  and  G 2  goodness-of-fit  statistics  provide  a  model  check 
when  data  are  not  sparse.  When  an  explanatory  variable  is  continuous  or  the  data  are  sparse, 
such  statistics  are  valid  only  for  comparing  nested  models  differing  by  relatively  few  terms. 


8.1.2  Example:  Alligator  Food  Choice 

Table  8.1  is  from  a  study  of  factors  influencing  the  primary  food  choice  of  alligators. 
The  study  captured  219  alligators  in  four  Florida  lakes.  The  nominal  response  variable 
is  the  primary  food  type,  in  volume,  found  in  an  alligator’s  stomach.  This  had  five  cat¬ 
egories:  fish,  invertebrate,  reptile,  bird,  other.  The  invertebrates  included  apple  snails, 
aquatic  insects,  and  crayfish.  The  reptiles  were  primarily  turtles,  although  one  stomach 
contained  the  tags  of  23  baby  alligators  released  in  the  lake  the  previous  year!  The 
“other”  category  consisted  of  amphibian,  mammal,  plant  material,  stones  or  other  de¬ 
bris,  or  no  food  or  dominant  type.  Table  8.1  also  classifies  the  alligators  according  to 
L  =  lake  of  capture  (Hancock,  Oklawaha,  Trafford,  George),  G  =  gender  (male,  female), 
and  S  =  size  (<2.3  meters  long,  >2.3  meters  long). 


Table  8.1  Primary  Food  Choice  of  Alligators,  by  Lake,  Gender,  and  Size  of  the  Alligator 


Lake 

Gender 

Size  (m) 

Primary  Food  Choice 

Fish 

Invertebrate 

Reptile 

Bird 

Other 

Hancock 

Male 

<  2.3 

7 

1 

0 

0 

5 

>  2.3 

4 

0 

0 

1 

2 

Female 

<  2.3 

16 

3 

2 

2 

3 

>  2.3 

3 

0 

1 

2 

3 

Oklawaha 

Male 

<  2.3 

2 

2 

0 

0 

1 

>  2.3 

13 

7 

6 

0 

0 

Female 

<  2.3 

3 

9 

1 

0 

2 

>  2.3 

0 

1 

0 

1 

0 

Trafford 

Male 

<  2.3 

3 

7 

1 

0 

1 

>  2.3 

8 

6 

6 

3 

5 

Female 

<  2.3 

2 

4 

1 

1 

4 

>  2.3 

0 

1 

0 

0 

0 

George 

Male 

<  2.3 

13 

10 

0 

2 

2 

>  2.3 

9 

0 

0 

1 

2 

Female 

<  2.3 

3 

9 

1 

0 

1 

>  2.3 

8 

1 

0 

0 

1 

Source;  Data  courtesy  of  Clint  Moore,  from  an  unpublished  manuscript  by  M.  F.  Delaney  and  C.  T.  Moore. 
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Table  8.2  Goodness  of  Fit  of  Baseline-Category  Logit  Models  for  Table  8.1  on 
Alligator  Primary  Food  Choice 


ModeF 

G2 

X2 

df 

Collapsed  over  G 

G2 

X2 

df 

() 

1 16.8 

106.5 

60 

() 

81.4 

73.1 

28 

(G) 

1 14.7 

101.2 

56 

C S ) 

101.6 

86.9 

56 

(S) 

66.2 

54.3 

24 

(L) 

73.6 

79.6 

48 

(L) 

38.2 

32.7 

16 

( L+S ) 

52.5 

58.0 

44 

(L+S) 

17.1 

15.0 

12 

(G  +  L  +  S) 

50.3 

52.6 

40 

"G,  gender;  S,  size;  L,  lake  of  capture. 


Baseline-category  logit  models  can  investigate  the  effects  of  L,  G,  and  S  on  primary 
food  type.  Table  8.2  contains  fit  statistics  for  several  models.  We  denote  a  model  by  its 
predictors:  for  instance,  ( L  +  S )  has  additive  lake  and  size  effects,  and  ( )  has  no  predictors. 
The  data  are  sparse,  219  observations  scattered  among  80  cells.  Thus,  G 2  is  more  reli¬ 
able  for  comparing  models  than  for  testing  a  model’s  fit.  The  statistics  G2[(  )|(G)]  =  2.1 
and  G 2  —  [( L  +  S)|(G  +  L  +  5)]  =  2.2,  each  based  on  df  =  4,  suggest  simplifying  by 
collapsing  the  table  over  gender.  (Other  analyses,  not  presented  here,  show  that  adding 
interaction  terms  including  G  do  not  improve  the  fit  significantly.)  The  G2  and  X2  val¬ 
ues  for  reduced  models  for  the  collapsed  table  indicate  that  both  L  and  S  have  effects. 
Table  8.3  exhibits  fitted  values  for  model  ( L  +  S)  for  the  collapsed  table.  Absolute  values 
of  standardized  residuals  comparing  observed  and  fitted  values  exceed  2  in  only  two  of  the 
40  cells  and  exceed  3  in  none  of  the  cells.  The  fit  seems  adequate. 

Fish  was  the  most  common  food  choice.  We  now  estimate  the  effects  of  lake  and  size 
on  the  odds  that  alligators  select  other  primary  food  types  instead  of  fish.  Let  s  —  1  for  size 

Table  8.3  Observed  and  Fitted  Values  for  Baseline-Category  Logit  Model  Using  Lake  and 
Size  of  Alligator  Main  Effects  to  Predict  Primary  Food  Choice 

Primary  Food  Choice 

Size  of  Alligator 


Lake 

(meters) 

Fish 

Invertebrate 

Reptile 

Bird 

Other 

Hancock 

<  2.3 

23 

4 

2 

2 

8 

(20.9) 

(3.6) 

(1.9) 

(2.7) 

(9.9) 

>  2.3 

7 

0 

1 

3 

5 

(9.1) 

(0.4) 

(1.1) 

(2.3) 

(3.1) 

Oklawaha 

<  2.3 

5 

1 1 

1 

0 

3 

(5.2) 

(12.0) 

(1.5) 

(0.2) 

(1.1) 

>  2.3 

13 

8 

6 

1 

0 

(12.8) 

(7.0) 

(5.5) 

(0.8) 

(1.9) 

Trafford 

<2.3 

5 

1 1 

2 

1 

5 

(4.4) 

(12.4) 

(2.1) 

(0.9) 

(4.2) 

>  2.3 

8 

7 

6 

3 

5 

(8.6) 

(5.6) 

(5.9) 

(3.1) 

(5.8) 

George 

<  2.3 

16 

19 

1 

2 

3 

(18.5) 

(16.9) 

(0.5) 

(1.2) 

(3.8) 

>  2.3 

17 

1 

0 

1 

3 

(14.5) 

(3.1) 

(0.5) 

(1.8) 

(2.2) 
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Table  8.4  Estimated  Parameters  in  Baseline-Category  Logit  Model  for 
Alligator  Food  Choice,  Based  on  Indicator  Variable  for  Size 
(1  =  Small,  0  =  Large)  and  for  Each  Lake  except  Lake  George" 


Logit'* 

Intercept 

Size  <  2.3 

Hancock 

Lake 

Oklawaha 

Trafford 

log  (n,/nF) 

-1.55 

1 .46  (0.40) 

-1.66  (0.61) 

0.94  (0.47) 

1.12(0.49) 

log  (nR/nF) 

-3.31 

-0.35  (0.58) 

1.24(1.19) 

2.46(1.12) 

2.94(1.12) 

\og(TTB/TTF) 

-2.09 

-0.63  (0.64) 

0.70  (0.78) 

-0.65  (1.20) 

1.09  (0.84) 

\og(n0/nF) 

-1.90 

0.33  (0.45) 

0.83  (0.56) 

0.01  (0.78) 

1.52  (0.62) 

"  SE  values  in  parentheses. 

h  Response  categories:  /,  invertebrate;  R ,  reptile;  B ,  bird;  O,  other;  F,  fish. 


<2.3  meters  and  0  otherwise,  let  Z/y  be  an  indicator  variable  for  Lake  Hancock  (z/y  =  1  for 
alligators  in  that  lake  and  0  otherwise),  and  let  zo  and  zF  be  indicator  variables  for  Lakes 
Oklawaha  and  Trafford.  With  fish  as  the  baseline  category,  Table  8.4  contains  ML  estimates 
of  effect  parameters.  We  use  letter  subscripts  to  denote  the  food  choice  categories.  For 
example,  the  prediction  equation  for  the  log  odds  of  selecting  invertebrates  instead  of  fish  is 

log(rfy /nF)  =  -1.55  +  1 .46s  —  1 .66zyy  +  0.94zo  +  1.1 2  zT  ■ 

Size  of  alligator  has  a  noticeable  effect.  For  a  given  lake,  for  small  alligators  the  estimated 
odds  that  primary  food  choice  was  invertebrates  instead  of  fish  are  exp(1.46)  =  4.3 
times  the  estimated  odds  for  large  alligators;  the  Wald  95%  confidence  interval  is 
exp[1.46±  1.96(0.396)]  =  (2.0,  9.3).  The  lake  effects  indicate  that  the  estimated  odds 
that  the  primary  food  choice  was  invertebrates  instead  of  fish  are  relatively  higher  at  Lakes 
Trafford  and  Oklawaha  and  relatively  lower  at  Lake  Hancock  than  they  are  at  Lake  George. 

The  equations  in  Table  8.4  determine  those  for  other  food-choice  pairs.  For  instance,  for 
the  pair  (invertebrate,  other), 

log(7T / /ft0)  -  log(7f//7fF)  -  l0g(7r0/7TF) 

=  ( — 1.55  +  1  ,46s  —  1 .66  z  yy  +  0.94z  q  +  1 . 1 2zF) 

—(—1.90  +  0.33s  +  0.83zw  +0.01zo  +  1.52zr) 

=  0.35+  1.13j  -2.48zH  +0.93zo  -0.39zr. 

Viewing  all  these,  we  see  that  size  has  its  greatest  impact  in  terms  of  whether  invertebrates 
rather  than  fish  are  the  primary  food  choice. 

8.1.3  Estimating  Response  Probabilities 

The  equation  that  expresses  multinomial  logistic  models  directly  in  terms  of  response 
probabilities  {^y(jr))  is 


exp(a +  PTjX :) 

1  +£a=i  exp(ov,  +  Pi  x) 


(8.2) 
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with  aj  =  0  and  f)  j  =  0.  This  follows  from  (8.1),  noting  that  (8.1)  also  holds  with  j  =  J 
by  setting  otj  =  0  and  jSy  —  0.  (The  parameters  also  equal  zero  for  a  baseline  category  for 
identifiability  reasons;  see  Exercise  8.31.)  The  denominator  of  (8.2)  is  the  same  for  each 
j.  The  numerators  for  various  j  sum  to  the  denominator,  so  tt j(x)  =  1.  For  J  =  2,  this 
formula  simplifies  to  the  binary  logistic  regression  probability  formula  (5.1). 

From  Table  8.4  the  estimated  probability  that  a  large  alligator  in  Lake  Hancock  has 
invertebrates  as  the  primary  food  choice  is 

e- I  55—  1.66 

~  1  e-l.55-1.66  _|_  e-3.3l  +  l.24  _|_  <,-2.09+0.70  _|_  ^-1.90+0.83  0.023. 

The  estimated  probabilities  for  (reptile,  bird,  other,  fish)  are  (0.072,  0.141,  0.194,  0.570). 

This  example  used  qualitative  predictors.  Multinomial  logit  models  can  also  contain 
quantitative  predictors.  In  this  study,  the  biologists  used  the  size  indicator  variable  to 
distinguish  between  adult  and  subadult  alligators.  However,  the  alligators’  actual  length 
was  measured  and  is  quantitative.  With  quantitative  predictors,  it  is  informative  to  plot  the 
estimated  probabilities.  To  illustrate,  for  alligators  at  one  lake.  Figure  8. 1  plots  the  estimated 
probabilities  that  primary  food  choice  is  fish,  invertebrate,  or  other  (which  combines  the 
other,  bird,  and  reptile  categories)  as  a  function  of  length.  With  more  than  two  response 
categories,  the  probability  for  a  given  category  need  not  continuously  increase  or  decrease 
(Exercise  8.32). 


8.1.4  Fitting  Baseline-Category  Logistic  Models 

ML  fitting  of  multinomial  logistic  models  maximizes  the  likelihood  subject  to  {7T;(jr)}  si¬ 
multaneously  satisfying  the  J  —  1  equations  that  specify  the  model.  Let  y,  =  (y,\ ,  . . . ,  y,j ) 
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represent  the  multinomial  trial  for  subject  t,  where  v,;  =  1  when  the  response  is  in  cat¬ 
egory  j  and  y,y  =  0  otherwise,  so  J2j  ytj  =  L  Let  x,  =  (x,i, . . . ,  Xjp)T  denote  explana¬ 
tory  variable  values  for  subject  Let  =  (Pj i, . . . ,  PjP)T  denote  parameters  for  the  /th 
baseline-category  logit. 

Since  nj  =  1  -  H - 1-  JTy_, )  and  yu  =  1  -(y,  i  H - h  yu-\),  the  contribution 

to  the  log  likelihood  by  subject  i  is 


r  j  y— t  ,  j-i  \  f  y_l 

log  n  Xj(x,)yii  =  E  y<j  lo8  71  Mi)  +  (  1  -  E  y‘J  )  log  1  “  E  nM‘) 

*-7  =  1  J  7  =  1  X  7  =  1  '  L  7  =  1  J 


7  =  1 
J- 1 


E  w  los  +  log  1 1  -  e  nJ Mi)  I  • 

U  I-E7-1  JT/*/)  L  y=r  -I 


7-1 


Thus,  the  baseline-category  logits  are  the  natural  parameters  for  the  multinomial  distribu¬ 
tion. 

Next,  we  construct  the  likelihood  equations,  for  n  independent  observations.  In  the  last 
expression  above,  we  substitute  ay  +  p)  x,  for  the  logit  in  the  first  term  and 


*Mi)  =  1 


7-1 

1  +  EeXP(“7  +  P'j  Xi) 

7  =  1 


in  the  second  term.  Then,  the  log  likelihood  is 


log 


n  1-  J 

nn 

i  =  1  L j  =  1 


7tj{Xi)yii 


n  *  J  —  I 


J- 1 


=  E  j  E  ^7 (“7  +  P)  xi)  ~  log  1  +  E  exP(“7  +  P)  x<) 

i= 1  *  7  =  1  L  7=1 

=  E  [«7  ( E  >y)  +  E  p*  ( E  ^ 

7  =  1  L  '  /  =  !  '  *=l  M=1  /J 


y-i 


-  E log  1  +  E  exp(a7  +  p]  x<) 

1=1  *-  7=1 


The  sufficient  statistic  for  /Jy*  is  JZ,  -r/jty/y,  j  =  1, —  l,fc  =  1 ,  The  sufficient 

statistic  for  ay  is  y,y  =  ^Z;  T;0y,y  for  x,0  =  1 ;  this  is  the  total  number  of  outcomes  in 
category  j. 

The  likelihood  equations  equate  the  sufficient  statistics  to  their  expected  values.  The  log- 
likelihood  function  is  concave,  and  the  Newton-Raphson  method  yields  the  ML  parameter 
estimates.  The  exception  is  when  there  is  a  choice  of  baseline  category  such  that  complete 
or  quasi-complete  separation  occurs  for  each  logit  when  paired  with  another  category.  In 
that  case,  some  estimates  and  SE  values  are  actually  infinite. 
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The  estimators  have  large-sample  normal  distributions.  As  usual,  standard  errors  are 
square  roots  of  diagonal  elements  of  the  inverse  information  matrix. 

8.1.5  Multicategory  Logit  Model  as  a  Multivariate  GLM 

For  a  univariate  response  variable  in  the  natural  exponential  family,  a  GLM  has  form 
g(lii)  —  xj  ft  for  a  link  function  g,  expected  response  /r,  =  £(T,  ),  vector  of  values  jc,  of 
p  explanatory  variables  for  observation  /,  and  parameter  vector  ft  =  (fa, ,  Pp)T .  This 
extends  to  a  multivariate  generalized  linear  model  for  distributions  in  the  multivariate 
exponential  family  (Exercise  8.29),  such  as  the  multinomial. 

For  response  vector  y,  for  subject  /,  with  [Lj  =  E(Y  j),  let  g  be  a  vector  of  link  functions. 
The  multivariate  GLM  has  the  form 


g(fli)  =  X;fa  (8.3) 

where  row  h  of  the  model  matrix  A",  for  observation  i  contains  values  of  explanatory 
variables  for  yp,  (Fahrmeir  and  Tutz  2001 ,  Chap.  3). 

The  baseline-category  logit  model  is  a  multivariate  GLM.  Let  y,  =  (y, ,  yij.-\)T , 
since  ytj  is  redundant,  /r,  =  (7T|(jc,),  . . . ,  nj-\(x,))T  and 


gj(P-i)  =  log(Mo/[l  -  (At/i  +  •  •  •  +  pij—\ )]}. 


The  model  matrix  for  observation  /  is 

/I  XJ 


Xi  = 


1  x] 


with  0  entries  in  other  locations,  and  —  (ai,  /)] , 


\ 

1  xf  / 


8.1.6  Multinomial  Probit  Models 

The  multinomial  logit  model  with  baseline-category  logits  results  from  a  latent  utility 
representation  that  generalizes  the  one  mentioned  in  Section  7.1.1.  Let  f/,y  denote  the 
utility  of  response  outcome  j  for  subject  i.  Suppose  that 

Uij  =aj+prjXj  +  €,j. 


The  response  outcome  for  subject  i  is  the  value  of  j  having  maximum  utility.  McFadden 
(1974)  showed  that  the  assumption  that  (e,y)  are  independent  and  have  the  extreme  value 
distribution  (i.e.,  cdf  F(e)  =  exp[— exp(— e)])  is  equivalent  to  multinomial  logit  model 
(8.1)  holding.  The  identifiable  parameters  for  that  model  are  (j8y  —  fij).  Likewise,  the 
utilities  are  identifiable  in  terms  of  relative  utilities  ((/,,-  —  U,j  ). 

It  may  seem  more  natural  to  assume  that  {e,,}  have  a  normal  distribution.  Aitchison 
and  Bennett  (1970)  suggested  this  approach,  for  independent  standard  normal  variates. 
The  corresponding  model,  called  the  multinomial  probit  model,  gives  a  similar  fit.  For  a 
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particular  explanatory  variable  xk  and  pair  of  categories  a  and  b,  (/3ak  —  phk)  describes 
the  effect  of  a  1-unit  increase  in  xk  on  the  difference  between  the  mean  utilities  for  those 
categories.  If  the  normal  distribution  for  {e,y  )  had  instead  been  scaled  to  have  some  fixed 
standard  deviation  a ,  then  ( fiak  —  fi^k)  would  describe  the  difference  in  mean  utilities  in 
terms  of  the  number  of  standard  deviations  of  the  utility  distribution. 

Fitting  the  multinomial  probit  model  is  computationally  more  complex  than  the  cor¬ 
responding  logit  model.  Finding  the  likelihood  function  requires  numerical  integration, 
because 


Jr, (*,■)  =  P(Uij  >  Uik ,  for  all  k  ^  j)=  EU;j[P{U,k  <  for  all  k  ^  j\Uu  =  «,,)] 
=  /  4>(Uij  -otj-P] Xj)  ]”[  -  otk  -  pTk  Xi)duij , 

J  k*j 


for  the  standard  normal  pdf  <p  and  cdf  <t> . 

It  often  seems  unrealistic  to  expect  the  errors  for  different  outcomes  in  the  utility  la¬ 
tent  model  to  be  uncorrelated.  A  more  general  model  permits  an  arbitrary  covariance 
matrix  for  (e,  | , . . . ,  e,/),  with  var(e,  i)  =  1  for  identifiability.  Fitting  is  then  even  more 
complex.  Natarajan  et  al.  (2000)  proposed  a  Monte  Carlo  EM  algorithm  for  ML  estima¬ 
tion  that  has  the  advantage  of  circumventing  direct  evaluation  of  the  likelihood  function 
by  taking  advantage  of  the  latent  structure.  See  also  Imai  and  van  Dyk  (2005)  and  Mc¬ 
Culloch  et  al.  (2000),  who  utilized  a  corresponding  latent  variable  model  introduced  in 
Section  8.6.3. 


8.1.7  Example:  Effect  of  Menu  Pricing 

Natarajan  et  al.  (2000)  described  a  study  to  investigate  the  effect  of  the  pricing  of  a  fish 
dish  in  a  restaurant  on  a  customer’s  choice  among  four  popular  food  choices.  On  several 
winter  Fridays  or  Saturdays  the  fish  dish  was  priced  between  $8.95  and  $10.95.  Data  were 
collected  for  974  orders.  Treating  the  fish  dish  as  the  baseline  category,  the  multinomial 
probit  model  provided  three  equations  for  the  difference  between  the  predicted  utility  for 
each  food  item  and  fish. 

For  example,  the  equation  relating  steak  (the  first  item)  to  the  fish  dish  had  predicted 
utility  difference  for  subject  /  of 


Un  -  Ui 4  =  0.168  -  0.502F,  -  0.072P,  , 


where  F,  =  1  for  Friday  and  0  for  Saturday,  and  P,  is  the  price  of  the  fish  item  when  subject 
/  ordered.  The  standard  errors  were  0. 178  for  the  Friday  effect  and  0.072  for  the  fish  pricing 
effect.  So,  the  fish  pricing  did  not  have  a  significant  effect  on  the  choice  between  fish  and 
steak  (higher  price  even  having  a  negative  estimated  effect  on  selecting  steak).  Natarajan 
et  al.  (2000)  used  a  general  covariance  structure  for  the  normal  errors,  with  var(f/,- 1  —  Un)  — 
1.0  for  identifiability.  Thus,  the  estimated  effect  of  Friday  was  to  depress  the  utility  for 
steak  relative  to  fish  by  half  a  standard  distribution  of  the  normal  distribution  for  the  utility 
difference. 
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8.2  ORDINAL  RESPONSES:  CUMULATIVE  LOGIT  MODELS 

We  have  discussed  the  benefits  of  utilizing  the  ordinality  of  a  variable  by  focusing  inferences 
on  a  single  parameter  (e.g.,  see  Section  5.3.7).  These  benefits  extend  to  models  for  ordinal 
responses.  Models  with  terms  that  reflect  ordinal  characteristics  such  as  monotone  trend 
have  improved  model  parsimony  and  power.  In  this  section  we  introduce  the  most  popular 
logistic  model  for  ordinal  responses. 


8.2.1  Cumulative  Logits 

We  utilize  the  category  ordering  by  forming  logits  of  cumulative  probabilities, 


P(Y  <  j\x)  =  m(x)  +  •  •  •  +  7ij(x),  y'=l,...,/. 

The  cumulative  logits  are  defined  as 

P(Y  <  j\x ) 


logit [PfT  <  7»]  =  log 
=  log 


1  —  P(Y  <  j\x) 

7T|(j:)  +  ---+^7(x) 
*/+l(*)  + •••+*■/(*)’ 


7  =  1 ,...,/  —  1 . 


(8.4) 


Each  cumulative  logit  uses  all  /  response  categories. 


8.2.2  Proportional  Odds  Form  of  Cumulative  Logit  Model 

A  model  for  logit[P(F  <  /)]  alone  is  an  ordinary  logistic  model  for  a  binary  response 
in  which  categories  1  to  j  form  one  outcome  and  categories  j  +  1  to  J  form  the  second. 
A  model  that  simultaneously  uses  all  (/  —  1)  cumulative  logits  in  a  single  parsimonious 
model  is 


logit[P(E  <  j\x)]  =  aj  +  Ptx,  7  =  1 . J  -  1.  (8.5) 

Each  cumulative  logit  has  its  own  intercept.  The  {a y}  are  increasing  in  j,  because  P(Y  < 
7 1  at)  increases  in  j  for  fixed  x  and  the  logit  is  an  increasing  function  of  P{Y  <  j  |;t). 

This  model  assumes  the  same  effects  fi  for  each  logit.  For  a  single  continuous  predictor 
x.  Figure  8.2  depicts  the  model  when  J  =  4.  For  fixed  j ,  the  response  curve  is  a  logistic 


Figure  8.2  Cumulative  logit  model  with  the  same  effect  on  each  of  three  cumulative  probabilities  in  a  four- 
category  response. 
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Figure  8.3  Individual  category  probabilities  in  cumulative  logit  model  with  four  response  categories. 


regression  curve  for  a  binary  response  with  outcomes  (Y  <  j)  and  (Y  >  j).  The  curves  for 
7=1,  2,  and  3  have  the  same  shape.  They  share  exactly  the  same  rate  of  increase  or  decrease 
but  are  horizontally  displaced  from  each  other.  Figure  8.3  portrays  the  corresponding  curves 
for  the  category  probabilities. 

The  cumulative  logit  model  (8.5)  satisfies 


logit[F(T  <  y’|xi)]  -  logit[P(T  <  j\x2)] 


,  P(Y  <j\xQ/P(Y  >  j\x\) 
°g  P(Y  <j\x2)/P(Y  >  j\x2) 


=  /3rU, 


x2). 


An  odds  ratio  of  cumulative  probabilities  is  called  a  cumulative  odds  ratio.  The  odds 
of  making  response  <  j  at  x  =  x\  are  exp[/?r(X|  —  X2)]  times  the  odds  at  x  =  x2.  The 
log  cumulative  odds  ratio  is  proportional  to  the  distance  between  Jtr  1  and  x2.  The  same 
proportionality  constant  applies  to  each  logit.  Because  of  this  property,  model  (8.5)  is  often 
called  a  proportional  odds  model  (McCullagh  1980). 

With  a  single  predictor,  the  cumulative  odds  ratio  equals  e ^  whenever  X]  —  X2  =  1 . 
Figure  8.4  illustrates  the  constant  cumulative  odds  ratio  this  model  then  implies  for  all  j. 
It  shows  the  /-category  response  collapsed  into  the  binary  outcome  (<  j,  >  j)  and  shows 
the  sets  of  cells  that  determine  the  cumulative  odds  ratio  that  takes  the  same  value  e Is  for 
each  such  collapsing. 

Model  (8.5)  constrains  the  J  —  1  response  curves  to  have  the  same  shape.  For  mul¬ 
ticategory  indicator  (y,  i, . . . ,  y,j)  of  the  response  for  subject  i,  the  product  multinomial 
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1 


j+ 1 


J 


Figure  8.4  Uniform  odds  ratios  AD/BC  whenever  x\  —  xi  =  1,  for  all  binary  collapsings  of  the  response  in 
cumulative  logit  model  of  proportional  odds  form. 


likelihood  function  is 


j 

j=\ 


J 


n 


exp(a j  +  0r  Xi) 

exp(ay_i  +  pTxt) 

yu  ■ 

_  1  +  exp(a  j  +  pTx,) 

1  +  exp(ay_,  +  PTXi)_ 

(8.6) 


viewed  as  a  function  of  ({oty },  fi).  This  can  be  maximized  to  obtain  the  ML  estimates  us¬ 
ing  Fisher  scoring  (McCullagh  1980,  Walker  and  Duncan  1967)  or  the  Newton-Raphson 
method.  The  SE  values  differ  somewhat,  as  the  expected  information  and  observed  infor¬ 
mation  matrices  are  not  the  same  for  this  non-canonical-link  model. 


8.2.3  Latent  Variable  Motivation  for  Proportional  Odds  Structure 

A  regression  model  for  a  latent  continuous  variable  assumed  to  underlie  Y  motivates  the 
common  effect  /J  for  different  j  in  the  proportional  odds  form  of  the  model  (Anderson 
and  Philips  1981).  Let  Y*  denote  this  underlying  latent  variable.  Suppose  that  it  has  cdf 
G(y*  —  rj),  where  values  of  y*  vary  around  a  location  parameter  rj  (such  as  a  mean)  that 
depends  on  x  through  r\(x)  =  f}1  x.  Suppose  that  the  thresholds  — oo  =  ao  <  ot\  <  •  •  •  < 
ay  =  oo  are  outpoints  of  the  continuous  scale  such  that  the  observed  response  y  satisfies 

y  =  j  if  «y_!  <  y*  <  ay. 

That  is,  y  falls  in  category  j  when  the  latent  variable  falls  in  the  jth  interval  of  values,  as 
Figure  8.5  depicts.  Then 

P(Y  <  j |x)  =  P(Y *  <  cij\x)  =  G(a,  -  pTx). 

The  appropriate  model  for  Y  implies  that  the  link  function  G_1,  the  inverse  of  the  cdf  for 
Y*,  applies  to  P(Y  <  j\x).  If  Y*  =  f)T x  +  e,  where  the  cdf  G  of  e  is  the  standard  logistic 
(Section  4.2.5),  then  G_l  is  the  logit  link  and  a  proportional  odds  model  results.  Normality 
for  s  implies  a  probit  link  for  cumulative  probabilities  (Section  8.3.2). 

In  this  derivation,  the  same  parameters  /J  occur  for  the  effects  regardless  of  how  the 
cutpoints  {ay  (  chop  up  the  scale  for  the  latent  variable.  The  effect  parameters  are  invariant 
to  the  choice  of  categories  for  Y.  If  a  continuous  variable  measuring  political  ideology  has  a 
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Figure  8.5  Ordinal  measurement  and  underlying  regression  model  for  a  latent  variable. 


linear  regression  with  some  predictor  variables,  then  the  same  apply  to  a  discrete  version 
of  political  ideology  with  the  categories  (liberal,  moderate,  conservative)  or  (very  liberal, 
slightly  liberal,  moderate,  slightly  conservative,  very  conservative).  This  feature  makes  it 
possible  to  compare  estimates  from  studies  using  different  response  scales. 

Using  a  cdf  of  form  G(y*  —  r 7)  for  the  latent  variable  resulted  in  linear  predictor  ay  — 
ft  x  rather  than  aj  +  ft  x.  When  ft  >  0,  as  x,t  increases  each  cumulative  logit  then 
decreases,  so  each  cumulative  probability  decreases  and  relatively  less  probability  mass 
falls  at  the  low  end  of  the  Y  scale.  Thus,  Y  tends  to  be  larger  at  higher  values  of  x ,•*.  With 
this  parameterization  the  sign  of  ft  has  the  usual  meaning.  However,  some  software  (e.g., 
SAS)  uses  form  (8.5). 


8.2.4  Example:  Happiness  and  Traumatic  Events 

Table  8.5  shows  GSS  data  on  Y  =  happiness  (categories  1  =  very  happy,  2  =  pretty  happy, 
3  =  not  too  happy),  X\  =  total  number  of  traumatic  events  that  happened  to  the  respondent 
and  his/her  relatives  in  the  last  year,  and  xj  =  race  (1  =  black,  0  =  white).  We  restricted 


Table  8.5  Four  Observations  from  Data  Set  on  Happiness,  Number  of  Traumatic  Events, 
and  Race 


Observation 

Happiness 

Number  of 

Traumatic  Events 

Race 

1 

Pretty  happy 

2 

White 

2 

Pretty  happy 

3 

Black 

3 

Very  happy 

0 

White 

4 

Not  too  happy 

5 

White 

Source:  1984  General  Social  Survey;  complete  data  at  www .  stat .  uf  1  .  edu/~aa/cda/cda  .  html  . 
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the  age  range  to  1 8-22  in  order  to  have  a  relatively  small  sample  ( n  =  97),  to  illustrate 
how  certain  models  may  then  have  infinite  ML  estimates.  In  particular,  only  13  of  the  97 
observations  were  in  the  black  category  of  race,  and  of  them,  none  had  response  in  the  very 
happy  category.  Table  8.5  shows  the  data  for  four  of  the  subjects.  The  complete  data  set  is 
at  the  text  website. 

The  main-effects  cumulative  logit  model  of  proportional  odds  form  (8.5)  is 


logit[P(y  <  j\x)]  =  ctj  +/?iJf|  +  P2X2- 


Table  8.6  shows  output.  With  J  =  3  response  categories,  the  model  has  two  {aj}  intercepts. 
Usually,  these  are  not  of  interest  except  for  computing  response  probabilities.  The  parameter 
estimates  yield  estimated  logits  and  hence  estimates  of  P(Y  <  j),  P(Y  >  j),  or  P(Y  —  j). 
We  illustrate  for  white  subjects  (x2  =  0)  at  the  mean  number  of  traumatic  events  score  of 
x i  =  1 .536.  Since  or  =  —0.5 1 8,  the  estimated  probability  of  response  very  happy  is 


P(Y  =  1)  =  P(Y  <  1)  = 


exp[-0.51 8  -0.406(1.536)] 

1  +  exp[— 0.5 1 8  -  0.406(  1 .536)] 


=  0.24. 


Figure  8.6  plots  P(Y  <  2)  as  a  function  of  the  number  of  traumatic  events,  at  the  two  levels 
of  race.  An  alternative  way  to  portray  the  model  is  to  plot  the  parallel  straight  lines  for  the 
fit  in  terms  of  the  underlying  latent  variable. 

The  effect  estimates  ji\  —  —0.406  and  j}2  =  —2.036  suggest  that  the  cumulative  proba¬ 
bility  starting  at  the  very  happy  end  of  the  happiness  scale  decreases  as  the  traumatic  events 
score  increases  and  is  lower  for  blacks  than  for  whites.  For  example,  given  the  traumatic 
events  score,  for  whites  the  estimated  odds  of  reporting  being  very  happy  were  e2  036  =  7.7 
times  the  estimated  odds  for  blacks.  This  estimate  is  imprecise,  because  relatively  few 
observations  were  in  the  black  category.  The  95%  profile  likelihood  confidence  interval 
for  — f5i  is  (0.72,  3.43),  corresponding  to  (2.05,  30.84)  for  the  odds  ratio  effect.  The  SE 
values  reported  are  based  on  the  expected  information  from  Fisher  scoring.  Using  observed 
information  (from  Newton-Raphson),  j3 \  and  have  SE  values  of  0. 1 83  and  0.686  instead 
of  0.181  and  0.691. 

Descriptions  of  effects  can  compare  cumulative  probabilities  rather  than  use  odds  ra¬ 
tios.  These  can  make  it  easier  to  conceptualize  the  sizes  of  effects.  We  describe  effects 
of  quantitative  variables  by  comparing  probabilities  at  their  extreme  values  or  at  their 


Table  8.6  Software  Output  (Based  on  SAS)  for  Fitting  Cumulative  Logit  Model  to 
Data  on  Happiness 


Score  Test 

for  the 

Proportional  Odds  Assumption 

Chi-Square 

DF 

Pr  >  ChiSq 

0.8668 

2 

0 . 6483 

Std 

Like.  Ratio  95% 

Chi- 

Parameter 

Estimate 

Error 

Conf  Limits 

Square 

Pr  >  ChiSq 

Interceptl 

-0.5181 

0 . 3382 

-1.2020  0.1392 

2 . 35 

0 . 1255 

Intercept2 

3.4006 

0 . 5648 

2.3779  4.6266 

36.25 

< . 0001 

traumatic 

-0.4056 

0.1809 

-0.7729  -0.0520 

5 . 03 

0 . 0249 

race 

-2.0361 

0.6911 

-3.4287  -0.7156 

8 . 68 

0 . 0032 
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Predicted  Probabilities  for  happiness  =  1  or  2 


Figure  8.6  Estimated  values  of  P(Y  <  2)  by  .V|  =  number  of  traumatic  events  and  xi  =  race. 


quartiles.  We  describe  effects  of  qualitative  variables  by  comparing  probabilities  for  dif¬ 
ferent  categories.  We  fix  values  of  quantitative  variables  by  setting  them  at  their  mean  or 
median.  For  qualitative  variables  we  fix  the  category,  unless  there  are  several,  in  which  case 
we  can  set  each  at  their  indicator  means. 

We  illustrate  again  with  P(Y  =  1),  the  very  happy  outcome.  First,  we  describe  the  race 
effect.  At  the  mean  number  of  traumatic  events  of  1 .536,  P{Y  =  1 )  =  0.04  for  blacks  (i.e., 
X2  =  1)  and  0.24  for  whites  (xi  =  0).  Next,  we  describe  the  number  of  traumatic  events 
effect.  The  minimum  and  maximum  values  were  0  and  5.  For  blacks,  P{Y  =  1)  changes 
from  0.07  to  0.01  between  these  values;  for  whites,  it  changes  from  0.37  to  0.07.  (Note  that 
comparing  0.07  to  0.37  at  the  minimum  and  0.01  to  0.07  at  the  maximum  provides  further 
information  about  the  race  effect.)  The  sample  effect  is  substantial  for  both  predictors. 
However,  these  summaries  are  highly  tentative  and  have  large  standard  errors,  because  the 
black  sample  had  only  13  observations,  of  whom  none  reported  more  than  3  traumatic 
events. 


8.2.5  Checking  the  Proportional  Odds  Assumption 

Models  in  this  section  used  the  proportional  odds  assumption  of  the  same  effects  for 
different  cumulative  logits.  An  advantage  is  that  effects  are  simple  to  summarize,  requiring 
only  a  single  parameter  for  each  predictor.  The  models  generalize  to  include  separate 
effects,  replacing  in  (8.5)  by  pj.  This  implies  nonparallelism  of  curves  for  different 
logits.  However,  curves  for  different  cumulative  probabilities  may  then  cross  for  some 
x  values.  Such  models  then  violate  the  proper  order  among  the  cumulative  probabilities 
(Exercise  8.37). 

Even  if  such  a  model  fits  better  over  the  observed  range  of  x,  for  reasons  of  parsimony  the 
simpler  model  might  be  preferable.  One  case  is  when  effects  {/?,  }  with  different  logits  are 
not  substantially  different  in  practical  terms.  Then  the  significance  in  a  test  of  proportional 
odds  may  reflect  primarily  a  large  value  of  n.  Even  with  smaller  n,  although  effect  estimators 
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using  the  simpler  model  are  biased,  they  may  have  smaller  MSE  than  estimators  from  a 
model  having  many  more  parameters.  So  even  if  a  test  of  proportional  odds  has  a  small 
P-value,  don't  discard  this  model  automatically. 

The  output1  in  Table  8.6  also  presents  a  score  test  of  the  proportional  odds  property. 
This  tests  whether  the  effects  are  the  same  for  each  cumulative  logit  against  the  alternative 
of  separate  effects.  It  compares  the  model  with  one  parameter  for  ,vi  and  one  for  xj  to 
the  more  complex  model  with  two  parameters  for  each,  allowing  different  effects  for 
logit[P(T  <  1)]  and  logit[P(T  <  2)].  Here,  the  score  statistic  equals  0.87.  It  has  df  =  2, 
since  the  more  complex  model  has  two  additional  parameters.  The  more  complex  model 
does  not  fit  significantly  better  (P  —  0.65). 

When  this  score  test  has  a  small  P-value,  it’s  helpful  to  check  whether  the  violation  of 
the  proportional  odds  property  is  substantively  important,  by  comparing  estimates  obtained 
from  separate  logistic  fits  to  the  binary  collapsings  of  the  response.  For  these  data,  consider 
the  effect  of  the  number  of  traumatic  events.  The  model  with  binary  response  categories 
(very  happy,  pretty  happy  or  not  too  happy)  has  0\  =  —0.339  (SE  =  0.213),  whereas  the 
model  with  binary  categories  (very  happy  or  pretty  happy,  not  too  happy)  has  =  —0.487 
(SE  =  0.276).  The  effect  has  the  same  direction  and  a  similar  magnitude  in  each  case,  and 
it  is  sensible  to  use  the  simpler  proportional  odds  structure.  There  is  less  information  in  the 
data  about  the  race  effect.  We  obtain  /T  =  —  1 .846  (SE  -  0.763)  for  the  second  collapsing 
but  02  =  — oo  for  the  first  collapsing  because  there  were  no  observations  for  blacks  in  the 
very  happy  category  and  there  is  quasi-complete  separation  for  that  logit. 

If  a  proportional  odds  model  fits  poorly  in  terms  of  practical  as  well  as  statistical 
significance,  alternative  strategies  exist.  These  include  (1)  adding  additional  terms,  such  as 
interactions,  to  the  linear  predictor;  (2)  trying  a  link  function  for  which  the  response  curve 
is  nonsymmetric  (e.g.,  complementary  log-log);  (3)  using  an  alternative  ordinal  model  for 
which  the  more  complex  non-proportional-odds  form  is  also  valid;  (4)  adding  dispersion 
parameters;  (5)  permitting  separate  effects  for  each  logit  for  some  but  not  all  predictors 
(i.e.,  partial  proportional  odds)',  and  (6)  fitting  baseline-category  logit  models  and  using 
the  ordinality  in  an  informal  way  in  interpreting  the  associations. 

For  approach  (1),  more  complex  cumulative  logit  models  are  formulated  as  in  ordinary 
logistic  regression.  For  the  example  on  modeling  happiness,  permitting  interaction  yields  a 
model  with  ML  fit 

logit[W  <  j\x)]  =  otj  —0.469* ,  -  3.057x2  +  0.608(x,x2), 


where  the  coefficient  of  X|X2  has  SE  =  0.601.  The  estimated  effect  of  the  number  of 
traumatic  events  on  the  cumulative  logit  is  —0.469  for  whites  and  (—0.469  +  0.608)  = 
0. 139  for  blacks.  The  impact  of  the  number  of  traumatic  events  may  be  quite  different  (and 
possibly  nonexistent)  for  blacks,  but  recall  that  the  black  sample  had  only  13  observations, 
and  here  the  difference  in  effects  is  not  significant. 

In  the  next  section  we  generalize  the  cumulative  logit  model  to  permit  extension  (2) 
of  alternative  link  functions.  In  Sections  8.3.4  and  8.3.6  we  introduce  models  that  satisfy 
option  (3).  Section  8.3.8  and  Note  8.8  discuss  extension  (4).  For  approach  (5),  see  Peterson 
and  Harrell  (1990),  Stokes  et  al.  (2012),  and  criticism  by  Cox  ( 1995).  Agresti  (2010,  Chap. 
3-5)  discussed  further  these  alternative  strategies. 

1  Obtained  using  PROC  LOGISTIC  in  SAS. 
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8.3  ORDINAL  RESPONSES:  ALTERNATIVE  MODELS 

Cumulative  logit  models  use  the  logit  link.  As  in  binary  GLMs,  other  link  functions  are 
possible.  In  this  section  we  introduce  models  having  alternative  link  functions  either  for 
cumulative  probabilities  or  other  response  probabilities. 

8.3.1  Cumulative  Link  Models 

Let  G_l  denote  a  link  function  that  is  the  inverse  of  the  continuous  cdf  G  (recall  Section 
4.2.5).  The  cumulative  link  model 

G-'[P(Y  <j\x)]=aj+0Tx  (8.7) 

links  the  cumulative  probabilities  to  the  linear  predictor.  The  logit  link  function  G~\u)  = 
log[w /( 1  —  «)]  is  the  inverse  of  the  standard  logistic  cdf. 

As  in  the  cumulative  logit  model  with  proportional  odds  form  (8.5),  effects  of  x  in 
(8.7)  are  the  same  for  each  cumulative  probability.  In  Section  8.2.3  we  showed  that  this 
assumption  holds  whenever  a  latent  variable  F*  satisfies  a  linear  regression  model  with 
standard  cdf  G  for  the  error  term.  Model  (8.7)  results  from  discrete  measurement  of  Y *  from 
a  location-parameter  family  having  cdf  G(y*  -  f}7  x).  The  parameters  [ctj]  are  category 
cutpoints  (or  “thresholds”)  on  a  standardized  version  of  the  latent  scale.  Thus,  we  can  regard 
cumulative  link  models  as  regression  models  that  use  a  linear  predictor  j87  x  to  describe 
effects  of  explanatory  variables  on  crude  ordinal  measurement  of  Y*.  Using  —  ft  rather  than 
+j8  in  the  linear  predictor  merely  results  in  a  change  of  sign  of  ft. 

8.3.2  Cumulative  Probit  and  Log-Log  Models 

The  cumulative  probit  model  is  the  cumulative  link  model  using  the  standard  normal 
cdf  <t>  for  G.  This  generalizes  the  binary  probit  model  (Section  7.1)  to  ordinal  responses. 
It  is  appropriate  when  the  conditional  distribution  for  the  latent  variable  Y*  is  normal. 
Parameters  in  probit  models  refer  to  effects  on  E(Y*).  For  instance,  consider  the  model 
<t>~'[P(y  <  j )]  =  aj  —  fix.  From  Section  8.2.3,  since  Y*  =  fix  +  e  where  e  ~  N( 0,  1) 
has  cdf  <t>,  a  1-unit  increase  in  x  corresponds  to  a  /S  increase  in  E{Y*).  When  e  need  not  be 
in  standard  form  with  a  =  1,  a  1-unit  increase  in  x  corresponds  to  a  fi  standard  deviation 
increase  in  E{Y*). 

Cumulative  probit  models  provide  fits  similar  to  cumulative  logit  models.  They  have 
smaller  estimates  and  standard  errors  because  the  standard  normal  distribution  has  standard 
deviation  1.0  compared  with  1.81  for  the  standard  logistic. 

An  underlying  extreme  value  distribution  for  Y*  implies  the  model 

log{— log[l  —  P(Y  <  j\x)]}  =«,  +/}Tx. 

In  Section  7. 1  we  introduced  this  complementary  log-log  link  for  binary  data.  The  ordinal 
model  using  this  link  is  sometimes  called  a  proportional  hazards  model  since  it  results 
from  a  generalization  of  the  proportional  hazards  model  for  survival  data  to  handle  grouped 
survival  times  (McCullagh  1980,  Note  8.6).  It  has  the  property 


P(Y  >  j\Xl)  =  [P(Y  >  i|x2)]expl/,r(j:i“jr2)1. 
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With  this  link,  P(Y  <  j)  approaches  1.0  at  a  faster  rate  than  it  approaches  0.0.  The  related 
log-log  link  log{— log[P(T  <  7')]}  is  appropriate  when  the  complementary  log-log  link 
holds  for  the  categories  listed  in  reverse  order. 

McCullagh  ( 1980)  and  Thompson  and  Baker  (1981)  treated  cumulative  link  models  as 
multivariate  GLMs.  McCullagh  presented  a  Fisher  scoring  algorithm  for  ML  estimation.  He 
showed  that  sufficiently  large  n  guarantees  a  unique  maximum  of  the  likelihood.  Burridge 
(1981)  and  Pratt  (1981)  showed  that  the  log  likelihood  is  concave  for  many  cumulative 
link  models,  including  the  logit,  probit,  and  complementary  log-log.  Iterative  algorithms 
usually  converge  rapidly  to  the  ML  estimates. 

8.3.3  Example:  Happiness  Revisited  with  Cumulative  Probits 

In  Section  8.2.4  we  modeled  Y  =  happiness  in  terms  of  x\  =  total  number  of  traumatic 
events  that  happened  to  the  respondent  and  his/her  relatives  in  the  last  year,  and  *i  =  race. 
The  cumulative  logit  model  gave  the  fit 

logit[P(T  <  7')]  =  6tj  -  0.406*1  -  2.036*2. 

The  corresponding  cumulative  probit  model  has  fit 


<J>_I[P(T  <  7)]  =  &j  —  0.221*|  -  1.157*2, 


with  SE  =  0.098  for  $\  =  —0.221  and  SE  =  0.382  for  $2  =  — 1.1 57.  The  nature  of  the 
effects  and  the  substantive  significance  is  the  same  for  the  two  models. 

We  can  interpret  parameter  estimates  in  terms  of  the  underlying  latent  variable  model. 
For  example,  conditional  on  the  number  of  traumatic  events,  the  latent  distribution  on 
happiness  is  estimated  to  have  location  for  whites  that  is  1.157  standard  deviations  in  the 
more  happy  direction  compared  with  that  for  blacks. 

8.3.4  Adjacent-Categories  Logit  Models 

Models  for  ordinal  responses  need  not  use  cumulative  probabilities.  For  the  logit  link, 
for  example,  ordinal  logits  can  use  pairs  of  adjacent  response  probabilities.  The  adjacent- 
categories  logits  are 

logit[/>(K  =  j\Y  =  j  or  7'  +  1)]  =  log  j=l,...,J-\.  (8.8) 

71  j+ 1 

These  logits  are  a  basic  set  equivalent  to  the  baseline-category  logits.  The  connections  are 
log  —  =  log  +  log  ' -  -I - F  log  — —  (8.9) 

TTj  7Tj+\  7tj+2  Ttj 


and 


log 


nJ+ 1 


=  log 


77  J 


log 


11  j+ 1 
77  J 


7  =  1 . J-  I- 


Either  set  determines  logits  for  all  (  t  J  pairs  of  response  categories. 
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Models  using  adjacent-categories  logits  can  be  expressed  as  baseline-category  logit 
models.  For  instance,  consider  the  adjacent-categories  logit  model  of  proportional  odds 
form. 


,  7T/(JC)  - 

log— —  =Uj  +0Tx,  j  =  1 —  1 , 
*j+  iW 


(8.10) 


with  common  effect  p.  From  adding  ( J  —  j)  terms  as  in  (8.9),  the  equivalent  baseline- 
category  logit  model  is 


log 


nj(x) 

xj(x) 


j-  i 

k+PT(J~j)x,  j  =  \ . /-I 

k=j 


=  a*  +  pTUj,  j  =  1 . J-\ 


with  Uj  =  (J  —  j)x.  The  adjacent-categories  logit  model  corresponds  to  a  baseline- 
category  logit  model  with  adjusted  model  matrix  but  also  a  single  parameter  for  each 
predictor. 

The  construction  of  the  adjacent-categories  logits  recognizes  the  ordering  of  Y  categories. 
To  benefit  from  this  in  model  parsimony  requires  appropriate  specification  of  the  linear 
predictor.  When  an  explanatory  variable  has  similar  effect  for  each  logit,  advantages  accrue 
from  using  the  proportional  odds  form  (8.10)  with  a  single  parameter  instead  of  (J  —  1) 
parameters  describing  that  effect.  This  model  fits  well  in  similar  situations  as  the  cumulative 
logit  model  of  proportional  odds  form.  Your  choice  of  model  type  may  reflect  whether  you 
prefer  effects  to  refer  to  individual  response  categories,  as  the  adjacent-categories  logits 
provide,  or  instead  to  groupings  of  categories  using  the  entire  scale  or  an  underlying  latent 
variable,  which  cumulative  logits  provide.  Since  effects  in  cumulative  logit  models  refer 
to  the  entire  scale,  they  are  usually  larger  in  magnitude.  The  ratio  of  estimate  to  standard 
error,  however,  is  usually  similar  for  the  two  model  types. 

An  advantage  of  the  cumulative  logit  model  is  the  approximate  invariance  of  effect 
estimates  to  the  choice  and  number  of  response  categories.  An  advantage  of  the  adjacent- 
categories  logit  model  is  that  the  more  general  model  with  ft  replaced  by  /?/  is  a  valid 
model  (i.e.,  cumulative  probabilities  will  not  be  out  of  order),  namely,  one  that  is  exactly 
equivalent  to  an  ordinary  baseline-category  logit  model.  Also,  because  of  its  equivalence 
with  canonical-link  (baseline-category  logit)  models,  the  model  has  reduced  sufficient 
statistics  and  we  can  use  conditional  ML  estimation  for  inference  with  small  samples  or 
many  parameters.  Finally,  its  effects  can  be  estimated  with  case-control  studies  (Mukherjee 
and  Liu  2008). 


8.3.5  Example:  Happiness  Revisited 

We  return  to  the  example  in  Sections  8.2.4  and  8.3.3  on  modeling  happiness  in  terms  of  Xi 
=  total  number  of  traumatic  events  and  X2  =  race.  The  adjacent-categories  logit  model  of 
proportional  odds  form  has  ML  fit 


log [P(Y  =  j)/P(Y  =  j  +  I)]  =  otj  -  0.357x,  -  1.842.y2. 
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Conditional  on  the  number  of  traumatic  events,  the  estimated  odds  of  being  very  happy 
instead  of  pretty  happy,  and  the  estimated  odds  of  being  pretty  happy  instead  of  not  too 
happy,  are  c 1,842  =  6.3 1  times  as  high  for  whites  as  for  blacks.  By  contrast,  the  cumulative 
logit  model  had  fi\  —  —0.406  and  =  —2.036.  As  expected,  its  estimates  are  somewhat 
larger  in  magnitude.  They  are  not  much  different  for  these  data,  however,  because  65  of  the 
97  observations  fall  in  the  middle  of  the  three  response  categories  (pretty  happy). 

For  these  data,  the  more  general  model  having  different  effects  for  each  adjacent- 
categories  logit  has  estimate  —  00  for  the  effect  of  race  for  the  first  logit,  because  there  is 
quasi-complete  separation  for  that  logit.  The  estimates  for  the  effect  of  number  of  traumatic 
events  are  —0.299  for  the  first  logit  and  —0.432  for  the  second  logit,  suggesting  that  it  is 
adequate  to  use  the  more  parsimonious  model  of  proportional  odds  form  with  its  common 
estimate  of  —0.357. 


8.3.6  Continuation-Ratio  Logit  Models 

The  continuation-ratio  logits  are  defined  as 

log -  *J  - ,  j  =  1 —  1 

7Tj+\  +■■■  +  tv  j 


or  as 


log 


nj+ 1 

rt  1  4-  •  •  •  +  7i  j 


j  =  I —  1 . 


(8.11) 


(8.12) 


The  continuation-ratio  logit  model  form  is  useful  when  a  sequential  mechanism,  such  as 
survival  through  various  age  periods,  determines  the  response  outcome  (e.g.,  Tutz  1991). 
Let  ary  =  P(Y  =  j\Y  >  j).  With  explanatory  variables. 


(Oj(x) 


nj(x) _ 

Ttj(x)  4 - \-nj(x)' 


j 


1 . J  -  I. 


(8.13) 


The  continuation-ratio  logits  (8.11)  are  ordinary  logits  of  these  conditional  probabilities: 
namely,  log[«7(x)/(l  -  <Dj(x))]. 

At  the  ith  setting  jc,  of  x ,  let  {y;7,  j  =  1,...,/}  denote  the  response  counts,  with 
n,  =  Ylj  Yij  ■  When  n,  =  I ,  ytJ  indicates  whether  the  response  is  in  category  j,  as  in  Section 
8.1.4.  Let  b(n,  y;co)  denote  the  binomial  probability  of  y  successes  in  n  trials  with  parameter 
co  for  each  trial.  From  the  representation  of  the  multinomial  probability  of  (y,i, . . . ,  y,  /) 
in  the  form  p(yn)p(yn\yii)  ■  •  •  piyij  |y,  i , . . . ,  y-,,j~  1),  it  follows  that  the  multinomial  mass 
function  has  factorization 


b[ni ,  yn\(Oi(Xi)]b[ni  -  y,i,  yn\ w2(*;)]  •  ■  • 

b\n,  -  yn - yu-2,  yu-\,a>j-i(Xi)].  (8.14) 

The  full  likelihood  is  the  product  of  multinomial  mass  functions  from  the  different  x-, 
values.  Thus,  the  log  likelihood  is  a  sum  of  terms  such  that  different  a>j  enter  into  different 
terms.  When  parameters  in  the  model  specification  for  logit(a»y)  are  distinct  from  those 


312 


MODELS  FOR  MULTINOMIAL  RESPONSES 


for  logit  (&><)  whenever  j  ^  k ,  maximizing  each  term  separately  maximizes  the  full  log 
likelihood.  Thus,  separate  fitting  of  models  for  different  continuation-ratio  logits  gives  the 
same  results  as  simultaneous  fitting.  The  sum  of  the  J  —  1  separate  G 2  statistics  provides 
an  overall  goodness-of-fit  statistic  pertaining  to  the  simultaneous  fitting  of  J  —  1  models. 
Because  of  factorization  (8. 14),  separate  fitting  can  use  methods  for  binary  logistic  models. 
Similar  remarks  apply  to  continuation-ratio  logits  (8.12),  although  those  logits  and  the 
subsequent  analysis  do  not  give  equivalent  results. 

Sometimes,  a  simpler  proportional  odds  form  of  the  model  is  plausible  in  which  effects 
are  the  same  for  each  logit  (McCullagh  and  Nelder  1989,  p.  164;  Tutz  1991).  Because  of 
the  factorization  (8. 14),  it  is  also  possible  to  fit  such  a  model  simply  by  creating  a  data  file 
of  independent  binomials.  See  Agresti  (2010,  Sec.  4.2). 


8.3.7  Example:  Developmental  Toxicity  Study  with  Pregnant  Mice 

Table  8.7  comes  from  a  developmental  toxicity  study.  Such  experiments  with  rodents 
test  substances  posing  potential  danger  to  developing  fetuses.  Diethylene  glycol  dimethyl 
ether  (diEGdiME),  one  such  substance,  is  an  industrial  solvent  used  in  the  manufacture  of 
protective  coatings  such  as  lacquer  and  metal  coatings.  This  study  administered  diEGdiME 
in  distilled  water  to  pregnant  mice.  Each  mouse  was  exposed  to  one  of  five  concentration 
levels  for  10  days  early  in  the  pregnancy.  The  mice  exposed  to  level  0  formed  a  control 
group.  Two  days  later,  the  uterine  contents  of  the  pregnant  mice  were  examined  for  defects. 
Each  fetus  has  three  possible  outcomes  (nonlive,  malformation,  normal).  The  outcomes  are 
ordered,  with  nonlive  the  least  desirable  result.  We  use  continuation-ratio  logits  to  model 
(1)  the  probability  n\  of  a  nonlive  fetus,  and  (2)  the  conditional  probability  ^2/(^2  +  ^3) 
of  a  malformed  fetus,  given  that  the  fetus  was  live. 

We  fitted  the  continuation-ratio  logit  models 


log 


*1  (*;) 


n2(.Xj)  +  jr3(.v,) 


:  a,  +  fi\X, 


log 


713  (Xi) 


0t2  +  PlXi , 


using  Xj  scores  {0,  62.5,  1 25,  250,  500}  for  concentration  level.  The  ML  estimates  are  j3\  = 
0.0064  (SE  =  0.0004)  and  jij  =  0.0174  (SE  —  0.0012).  In  each  case,  the  less  desirable 
outcome  is  more  likely  as  the  concentration  increases.  For  instance,  given  that  a  fetus  was 


Table  8.7  Outcomes  for  Pregnant  Mice  in  Developmental 
Toxicity  Study 


Concentration 
(mg/kg  per  day) 

Response 

Nonlive 

Malformation 

Normal 

0  (controls) 

15 

1 

281 

62.5 

17 

0 

225 

125 

22 

7 

283 

250 

38 

59 

202 

500 

144 

132 

9 

"Based  on  results  inC.  J.  Price  etal.,  Fundam.Appl.  Toxicol.  8:  1 15-126,  1987 
I  thank  Louise  Ryan  for  showing  me  these  data. 
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live,  the  estimated  odds  that  it  was  malformed  rather  than  normal  multiplies  by  exp(  1 .74)  = 
5.7  for  every  100-unit  increase  in  the  concentration  of  diEGdiME.  The  likelihood-ratio  fit 
statistics  are  G 2  =  5.78  for  j  =  1  and  G2  =  6.06  for  j  =  2,  each  based  on  df  =  3.  Their 
sum,  G 2  =  1 1.84  (or  similarly  X 2  =  9.76),  with  df  =  6,  summarizes  the  fit. 

This  analysis  treats  pregnancy  outcomes  for  different  fetuses  as  independent,  identical 
observations.  In  fact,  each  pregnant  mouse  had  a  litter  of  fetuses,  and  statistical  dependence 
may  exist  among  different  fetuses  in  the  same  litter.  Different  litters  at  a  given  concentration 
level  may  also  have  different  response  probabilities.  Heterogeneity  of  various  sorts  among 
the  litters  (e.g.,  due  to  varying  physical  and/or  genetic  characteristics  among  different  preg¬ 
nant  mice)  would  cause  these  probabilities  to  vary  somewhat.  Either  statistical  dependence 
or  heterogeneous  probabilities  violates  the  binomial  assumption  and  causes  overdispersion. 
At  a  fixed  concentration  level,  the  number  of  fetuses  in  a  litter  that  die  may  vary  among 
pregnant  mice  more  than  if  the  counts  were  independent  and  identical  binomial  variates. 
The  total  G 2  shows  some  evidence  of  lack  of  fit  ( P  =  0.07)  but  may  reflect  overdispersion 
caused  by  these  factors  rather  than  an  inappropriate  choice  of  response  curve. 

To  account  for  overdispersion,  we  could  adjust  standard  errors  using  the  quasi-likelihood 
approach  (Section  4.7).  This  multiplies  standard  errors  by  x/X2/df  =  ^9.76/6  =  1 .28.  For 
each  logit,  strong  evidence  remains  that  ft ;  >  0.  In  Chapters  13  and  14  we  present  other 
methods  that  account  for  the  clustering  of  fetuses  in  litters. 

8.3.8  Stochastic  Ordering  Location  Effects  Versus  Dispersion  Effects 

For  cumulative  link  models,  settings  of  the  explanatory  variables  are  stochastically  or¬ 
dered  on  the  response:  For  any  pair  X|  and  *2,  either  P(Y  <  y  | a: i )  <  P(Y  <  j  1*2)  for 
all  j  or  P(Y  <  y  | jc i )  >  P{Y  <  j\x2 )  for  all  j.  Figure  8.7a  illustrates  for  underlying  con¬ 
tinuous  density  functions  and  cdf’s  at  two  settings  of  x.  Likewise,  the  adjacent-categories 
and  continuation-ratio  logit  models  with  proportional  odds  structure  imply  stochastically 
ordered  distributions  for  Y  at  different  predictor  values. 


Underlying 
Density  Functions 


Underlying 

Distribution  Functions 


(b) 

Figure  8.7  (a) Distribution  1  stochasticallyhigherthandistribution2;(b)distributionsnotstochasticallyordered. 
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When  this  is  violated  and  such  models  fit  poorly,  often  it  is  because  the  dispersion  also 
varies  with  x.  For  instance,  perhaps  responses  tend  to  concentrate  around  the  same  location 
but  more  dispersion  occurs  at  jc i  than  at  Xj.  Then  perhaps  P(Y  <  7 1 jc i )  >  P(Y  <  j | JC2) 
for  small  j  but  P(Y  <  j  |  x  1 )  <  P(Y  <  j\xi)  for  large  j.  In  other  words,  at  jc  j  the  responses 
concentrate  more  at  the  extreme  categories  than  at  Xj.  Figure  8.7b  illustrates  for  underlying 
continuous  distributions. 

Cumulative  link  models  have  been  proposed  that  incorporate  dispersion  effects,  but 
mainly  for  relatively  simple  cases  such  as  with  a  single  predictor  that  is  a  factor  (Note  8.8). 
A  simpler  approach  when  a  cumulative  link  model  fits  poorly  is  to  fit  the  model  separately 
for  each  cumulative  probability  to  investigate  the  nature  of  the  lack  of  fit  or  to  use  one  of 
the  other  options  mentioned  at  the  end  of  Section  8.2.5. 


8.3.9  Summarizing  Predictive  Power  of  Explanatory  Variables 

How  can  we  summarize  how  well  the  response  can  be  predicted  using  the  fit  of  the  chosen 
model?  One  approach  estimates  a  measure  such  as  the  multiple  correlation  or  /^-squared 
for  the  regression  model  for  an  underlying  latent  response  variable.  McKelvey  and  Zavoina 
(1975)  suggested  this  for  the  cumulative  probit  model. 

Another  index  of  predictive  power  generalizes  the  concordance  index  (Section  6.3.4). 
For  all  pairs  of  observations  that  have  different  response  outcomes,  it  estimates  the  proba¬ 
bility  that  the  predictions  and  the  outcomes  are  concordant,  that  is,  that  the  observation  with 
the  larger  y-  value  also  has  a  stochastically  higher  set  of  estimated  probabilities  (and  hence, 
for  example,  a  higher  mean  for  the  estimated  conditional  distribution).  The  baseline  value 
of  no  effect  is  0.50.  A  value  of  1 .0  results  when  knowing  which  observation  in  an  untied  pair 
has  the  stochastically  higher  estimated  distribution  enables  us  to  perfectly  predict  which 
one  has  the  higher  actual  response.  The  higher  the  value  of  the  concordance  index,  the 
better  the  predictive  power. 

Such  measures  are  mainly  useful  for  comparing  different  models.  For  example,  for 
the  happiness  data  analyzed  with  a  proportional  odds  type  of  cumulative  logit  model  in 
Section  8.2.4,  the  concordance  index  is  0.688  for  the  main-effects  model  and  0.689  when 
an  interaction  term  is  added.  So,  the  more  complex  model  is  not  much  more  useful  for 
predictions,  regardless  of  whether  its  extra  term  is  statistically  significant. 

Keep  in  mind  that  predictive  power  is  distinct  from  goodness  of  fit.  A  model  may  fit  a 
particular  data  set  well  even  if  the  predictive  power  the  model  provides  is  small.  For  other 
approaches  to  summarizing  predictive  power,  see  Agresti  (2010,  Sec.  3.4.6). 


8.4  TESTING  CONDITIONAL  INDEPENDENCE  IN  /  x  J  x  K  TABLES 

A  common  statistical  analysis  in  many  applications  is  studying  whether  an  explanatory 
variable  X  has  an  effect  on  a  response  variable  Y  after  we  adjust  for  one  or  more  other 
relevant  factors.  In  Section  6.4  we  considered  this  for  binary  Y  and  X  using  logistic 
models  and  the  Cochran-Mantel-Haenszel  (CMH)  test  of  conditional  independence  for 
2  x2  x  K  tables.  This  section  presents  related  tests  with  multicategory  variables,  in  the 
context  of  /  x  J  x  K  tables.  Likelihood-ratio  tests  compare  the  fit  of  a  model  specifying 
XY  conditional  independence  with  a  model  permitting  X  to  have  an  effect.  Generalizations 
of  the  CMH  statistic  are  score  statistics  for  certain  models. 
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8.4.1  Testing  Conditional  Independence  Using  Multinomial  Models 

Denote  a  control  factor  by  Z.  Treating  Z  as  nominal  scale,  we  discuss  four  cases  that  treat 
(F,  X)  as  (nominal,  nominal),  (nominal,  ordinal),  (ordinal,  nominal),  (ordinal,  ordinal). 
When  Y  is  nominal,  the  baseline-category  logit  model  of  XY  conditional  independence 
is 


,  P(Y  =j\X  =  i,Z=k) 
°g  P(Y  =  J\X  =i.  Z  =  k) 


=  “A- 


(8.15) 


That  is,  each  logit  does  not  depend  on  the  category  of  X.  For  ordinal  Y  we  use  cumula¬ 
tive  logit  models,  but  other  ordinal  models  yield  analogous  tests.  Then,  XY  conditional 
independence  is  equivalent  to  the  model 


logit [/’(K  <  j\X  =  i,Z  =  A)]  =  ajk. 


with  a\k  <  a2k  <  ■  ■  ■  <  a j-i,k  for  each  k.  When  the  XY  association  is  similar  in  the  partial 
tables,  the  power  of  a  test  benefits  from  basing  a  test  statistic  on  a  model  of  homogeneous 
association. 

1.  Y  nominal,  X  nominal.  An  alternative  to  XY  conditional  independence  that  treats  X 
as  a  factor  is 


,o  P(Y  —  j\X  —  i,Z  =  k) 
°g  P(Y  =  J\X  —  i,Z  =  k) 


=  <*jk  +  Pij 


(8.16) 


with  constraint  such  as  /S/y  =  0  for  each  j.  For  each  outcome  category  j,  X  and  Z  have 
additive  effects  of  form  ak  +  ft.  Conditional  independence  is  /ft:  fty  =  •  •  ■  =  ft; 
for  j  =  1 1.  Large-sample  chi-squared  tests  have  df  =  (/  —  1)(/  —  1). 

2.  Y  nominal,  X  ordinal.  Let  {.v, }  be  ordered  scores.  A  test  that  is  sensitive  to  the  same 
linear  trend  alternatives  in  each  partial  table  compares  the  conditional  independence 
model  to 


,  P(Y  =j\X  =i,Z=k) 

log - 

8  P(Y  =  J\X  =i,Z  =  k) 


-  Oljk  +  PjXi. 


Conditional  independence  is  Hq.  /3\  =  •  •  •  =  f5j_\  =  0.  Large-sample  chi-squared 
tests  have  df  =  J  —  1 . 

3.  Y  ordinal,  X  nominal.  An  alternative  to  XY  conditional  independence  that  treats  X  as 
a  factor  is 


logit[F(K  <  j\X  =i,Z  =  *)]  =  ajk  +  ft, 

with  a  constraint  such  as  ft  =  0.  A  simpler  model  that  also  has  proportional  odds 
structure  for  the  effects  of  Z  has  linear  predictor  a,  +  ftz  +  ft.  For  either  model,  XY 
conditional  independence  is  /ft:  ft  =  •  •  •  =  ft .  Large-sample  chi-squared  tests  have 
df  =  I  -  1. 
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Table  8.8  Summary  of  Models  for  Testing  Conditional  Independence" 


Conditional 

Y-X 

Model 

Independence 

df 

Ordinal-ordinal 

logit!/5 (F  <  j)]  =  ujk  +  Px, 

II 

O 

1 

Ordinal-nominal 

logitf /’(F  <  7)1  =ajk  +  Pi 

II 

II 

1  -  1 

Nominal-ordinal 

[  P(Y  =  j)  i 

°8L/>(F  =J)\=0ljk  +  PjXi 

3* 

II 

II 

5° 

i 

II 

o 

J  -  1 

Nominal-nominal 

C  p(Y  =  j)  -| 

°S[p(F  =  7)1  =aik+P‘i 

All  Pi,  -  0 

(/  -  W  -  1) 

"The  first  two  cases  can  also  use  aj  +  pf  in  place  of  c/jk . 

4.  Y  ordinal,  X  ordinal.  For  ordered  scores  j  v, } ,  the  model 

logit[P(F  <  j\X  =i,Z  =  it)]  =  ajk  +  fixj  (8.17) 

has  the  same  linear  trend  for  the  X  effect  in  each  partial  table.  A  simpler  model 
that  also  has  proportional  odds  structure  for  the  effects  of  Z  has  linear  predictor 
a  j  +  Pf  +  Px, .  For  either  model,  AY  conditional  independence  is  Ho'.  P  =  0.  Large- 
sample  chi-squared  tests  have  df  =  1 . 

Table  8.8  summarizes  the  four  tests.  They  work  well  when  the  model  describes  a  major 
component  of  the  departure  from  conditional  independence.  This  does  not  mean  that  we 
must  test  the  fit  of  the  model  in  order  to  use  the  test  (see  the  remarks  at  the  end  of  Section 
6.4.2). 

Occasionally,  the  association  may  change  dramatically  across  the  K  partial  tables.  When 
Z  is  ordinal,  an  alternative  by  which  a  log  odds  ratio  changes  linearly  across  levels  of  Z 
is  sometimes  of  use.  For  instance,  when  Z  =  age  of  subject,  the  association  between  a 
risk  factor  X  (e.g.,  level  of  smoking)  and  a  response  Y  (e.g.,  severity  of  heart  disease)  may 
tend  to  increase  with  Z.  When  Z  is  nominal,  the  conditional  independence  models  can  be 
compared  with  a  more  general  alternative  having  separate  effect  parameters  at  each  level  of 
Z.  Allowing  effects  to  vary  across  levels  ofZ,  however,  results  in  the  test  df  being  multiplied 
by  K,  which  handicaps  power. 

8.4.2  Example:  Homosexual  Marriage  and  Religious  Fundamentalism 

In  2008  the  General  Social  Survey  asked  whether  homosexuals  should  have  the  right  to 
marry.  One  variable  with  which  we’d  expect  responses  to  be  associated  is  the  fundamen¬ 
talism/liberalism  of  a  subject’s  religious  beliefs.  A  subject’s  attained  education  is  likely 
associated  with  both  these  variables,  so  is  there  an  association  when  we  condition  on  edu¬ 
cation?  Table  8.9  shows  the  relationship  between  opinion  about  homosexual  marriage  (T) 
and  religious  beliefs  (Ar),  stratified  by  Z  =  attained  education,  for  subjects  of  age  18-25. 

Table  8. 10  summarizes  the  fit  of  several  logistic  models  and  shows  the  results  of  related 
likelihood-ratio  tests  of  conditional  independence.  Each  test  compares  a  model  to  the  model 
deleting  the  religious  beliefs  effect,  conditioning  on  attained  education.  The  models  that 
treat  opinion  as  ordinal  use  cumulative  logits,  with  linear  predictor  oij  +  pf  +  /lx,  to  treat 
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Table  8.9  Opinion  About  Homosexual  Marriage  by  Religious  Beliefs, 
at  Two  Education  Levels 


Homosexuals  Should  Be  Able  to  Marry 

Education 

Religion 

Agree 

Neutral 

Disagree 

High  school 

Fundamentalist 

6 

2 

10 

or  less 

Moderate 

8 

3 

9 

Liberal 

1 1 

5 

6 

At  least 

Fundamentalist 

4 

2 

1 1 

some  college 

Moderate 

21 

3 

5 

Liberal 

22 

4 

1 

Source:  2008  General  Social  Survey,  subsample  for  ages  1 8-25. 

Table  8.10 

Summary  of  Model-Based  Likelihood-Ratio  Tests  of 

Conditional  Independence  for  Table  8.9 

Test 

Opinion 

Religion  G2  Fit 

df  Statistic 

df 

T- value 

Ordinal 

Ordinal  10.36 

8 

16.57 

1 

<0.0001 

Nominal  9.17 

7 

17.76 

2 

0.0001 

Not  in  model  26.93 

9 

— 

— 

— 

Nominal 

Ordinal  7.33 

6 

19.53 

2 

0.0001 

Nominal  6.58 

4 

20.27 

4 

0.0004 

Not  in  model  26.85 

8 

— 

— 

— 

X  as  an  ordinal  predictor  using  x,  scores  (1,  2,  3)  and  linear  predictor  aj  +  (if  +  f3 )  to 
treat  X  as  a  nominal  factor.  The  corresponding  tests  compare  these  to  the  model  with  linear 
predictor  aj  +  .  That  model  is  not  exactly  equivalent  to  the  conditional  independence 

model  (8.15),  which  is  the  last  model  listed  in  the  table,  with  G 2  =  26.85  based  on  df  =  8. 

Testing  conditional  independence  with  the  first  cumulative  logit  model  yields  likelihood- 
ratio  statistic  26.93  —  10.36  =  16.57  with  df  =  9  —  8  =  1,  strong  evidence  of  an  effect. 
Models  that  treat  either  or  both  variables  as  nominal  also  provide  strong  evidence,  but  not 
quite  as  strong.  Focusing  the  test  on  a  linear  trend  alternative  yields  a  smaller  P-value  when 
that  model  describes  reality  reasonably  well.  However,  we  learn  more  from  estimating 
model  parameters  than  from  these  significance  tests. 


8.4.3  Generalized  Cochran-lVIantel-Haenszel  Tests  for  /  x  J  x  A'  Tables 

The  CMH  statistic  generalizes  to  multiple  rows  and  columns.  The  tests  treat  X  and  Y 
symmetrically,  so  the  three  cases  correspond  to  treating  both  as  nominal,  both  as  ordinal, 
or  one  of  each.  Conditional  on  row  and  column  totals,  each  stratum  has  (/  —  1)(/  —  1) 
nonredundant  cell  counts.  Let 


fik  =  (nil*,  n\2k . « i, •  •  •  ’  ni-\,j-\,k)T ■ 
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Let  /it  =  E(nk)  under  Hq :  conditional  independence,  namely, 

/Ra  =  (ni+tn+u,  «i+*-«+2a,  •  •  • ,  «/-!.+, A«+,y-u)r/«++A- 

Let  Vic  denote  the  null  covariance  matrix  of  it*,  conditional  on  the  margins,  where 

^/+a(^//'^++A  A/,'  tt+j'k) 


CO \(nijk,  nrj'k) 


n++k(n++k  ~  1) 


with  8ah  =  1  when  a  =  b  and  8ah  —  0  otherwise. 

First,  suppose  the  rows  and  columns  are  unordered.  Let 


n  =  ^nk,  fi  =  ^2fik,  V  =  ^Vk. 


k  k  k 

The  generalized  CMH  statistic  for  nominal  X  and  Y  is 

CMH  =  (n  -  fi)TV~'(n  -  n). 


(8.18) 


Its  large-sample  chi-squared  distribution  has  df  =  (/  —  1)(7  —  1).  The  df  value  equals  that 
for  the  statistics  comparing  logistic  models  (8.15)  and  (8.16).  For  K  =  1  stratum  with  n 
observations,  CMH  =  [(«  —  1  )/n]X2,  where  X2  is  the  Pearson  statistic  (3.10)  for  testing 
independence. 

Next,  suppose  the  rows  and  columns  are  both  ordered.  For  ordered  scores  {/<, )  and  {v;  }, 
evidence  of  a  positive  trend  occurs  if  in  each  stratum  7)  =  E,  uivjnijk  exceeds  its  null 
expectation.  Given  the  marginal  totals,  under  conditional  independence 


E(Tt)  = 


var  (Tk)  = 


E  «»•«/+*  E  vJn+Jk  ]/  n++k , 


1 


n++k  ~  1 


E"'*' 


+k 


(E,  uini+k) 

n++k 


21 


E*? 


in+jk 


(El  Vj"+jk) 


n++k 


The  statistic  [7*  —  E{Tk)]/*J\ar(Tk)  equals  the  correlation  between  X  and  Y  in  stratum  k 
multiplied  by  V«++a  —  1  ■  To  summarize  across  the  K  strata  in  a  way  that  is  sensitive  to  a 
correlation  of  common  sign  in  each  stratum.  Mantel  ( 1 963)  proposed 


Ea[E,  Ey“-v7« 

ijk  -  E\ 

(LL‘ 

■iVjrijjk ) 

7 

Ea  var  1 

Uj  V  jfl  i  jk  ^ 

1 

(8.19) 


This  has  a  large-sample  xj  null  distribution,  the  same  as  for  testing  Hq:  ft  =  0  in  ordinal 
model  (8.17).  For  K  =  1,  this  is  the  M 2  correlation-based  statistic  (3.16). 
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Table  8.11  Output  (from  SAS,  PROC  FREQ)  for  Generalized  Cochran-Mantel-Haenszel 
Tests  with  Data  from  Table  8.9 


Summary  Statistics  for  opinion  by  religious  fundamentalism 
Controlling  for  education 

Cochran- -Mantel - -Haenszel  Statistics  (Based  on  Table  Scores) 


Statistic 

Alternative  Hypothesis 

DF 

Value 

Prob 

1 

Nonzero  Correlation 

1 

16.83 

<0.0001 

2 

Row  Mean  Scores  Differ 

2 

17.94 

0.0001 

3 

General  Association 

4 

19.76 

0.0006 

Landis  et  al.  (1978)  presented  a  statistic  that  has  (8.18)  and  (8.19)  as  special  cases.  Their 
statistic  also  can  treat  X  as  nominal  and  Y  as  ordinal,  summarizing  information  about  how 
/  row  means  compare  to  their  null  expected  values,  with  df  =  /  —  1  (see  Note  8.9). 


8.4.4  Example:  Homosexual  Marriage  Revisited 

Table  8. 1 1  shows  output  for  conducting  generalized  CMH  tests  with  Table  8.9.  Statistics 
treating  a  variable  as  ordinal  used  scores  ( 1 , 2,  3)  for  opinion  and  for  religious  beliefs. 

The  general  association  alternative  treats  X  and  Y  as  nominal  and  uses  (8.18).  It  is 
sensitive  to  any  association  that  is  similar  in  each  category  of  Z.  The  nonzero  correlation 
alternative  treats  X  and  Y  as  ordinal  and  uses  (8. 19).  It  is  sensitive  to  a  similar  linear  trend 
in  each  category  of  Z.  The  row  mean  scores  differ  alternative  treats  rows  as  nominal  and 
columns  as  ordinal.  It  is  sensitive  to  variation  among  the  /  row  mean  scores  on  Y,  when 
that  variation  is  similar  in  each  category  of  Z. 


8.4.5  Related  Score  Tests  for  Multinomial  Logit  Models 

The  generalized  CMH  tests  seem  to  be  non-model-based  alternatives  to  the  tests  of  Section 
8.4. 1  using  multinomial  logit  models.  However,  a  close  connection  exists  between  them.  For 
certain  multinomial  logit  models,  the  generalized  CMH  tests  are  score  tests  of  conditional 
independence. 

The  generalized  CMH  test  (8.18)  that  treats  X  and  Y  as  nominal  is  the  score  test  that 
the  (/  —  1)(7  —  1)  {fijj )  parameters  in  model  (8.16)  equal  0.  The  generalized  CMH  test 
using  M2  that  treats  X  and  Y  as  ordinal  is  the  score  test  of  ft  =  0  in  model  (8.17).  For  the 
cumulative  logit  model,  the  equivalence  has  the  same  (x,  )  scores  in  the  model  as  in  M2, 
and  the  {vy }  scores  in  M2  are  average  rank  scores.  For  the  adjacent-categories  logit  model 
analog  of  (8. 17),  the  {vy}  scores  in  M 2  are  any  equally  spaced  scores. 

With  large  samples  in  each  stratum,  the  generalized  CMH  tests  give  similar  results  as 
likelihood-ratio  tests  comparing  the  relevant  models.  An  advantage  of  the  model-based 
approach  is  providing  estimates  of  effects.  An  advantage  of  the  generalized  CMH  tests  is 
maintaining  good  performance  under  sparse  asymptotics  whereby  K  grows  as  n  does.  Also, 
they  are  valid  under  randomization  arguments  when  there  is  not  multinomial  sampling  from 
the  population  of  interest  but  the  multivariate  hypergeometric  distribution  applies  to  each 
stratum  under  the  null,  such  as  for  a  volunteer  sample  of  subjects  randomly  assigned  to 
treatments  in  a  clinical  trial. 
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8.5  DISCRETE-CHOICE  MODELS 

Many  applications  of  multinomial  logit  models  relate  to  determining  effects  of  explanatory 
variables  on  a  subject's  choice  from  a  discrete  set  of  options — for  instance,  transportation 
system  to  take  to  work  (driving  alone,  carpooling,  bus,  subway,  walk,  bicycle),  housing 
(buy  house,  buy  condominium,  rent),  primary  shopping  location  (downtown,  mall,  catalogs, 
Internet),  or  product  brand.  Models  for  response  variables  consisting  of  a  discrete  set  of 
choices  are  called  discrete-choice  models. 


8.5.1  Conditional  Logits  for  Characteristics  of  the  Choices 

In  many  discrete-choice  applications,  an  explanatory  variable  takes  different  values  for 
different  response  choices.  As  predictors  of  choice  of  transportation  system,  the  cost  and 
the  time  to  reach  the  destination  take  different  values  for  each  option.  As  a  predictor  of 
choice  of  product  brand,  the  price  varies  according  to  the  option.  Explanatory  variables 
of  this  type  are  characteristics  of  the  choices.  They  differ  from  the  usual  ones,  for  which 
values  remain  constant  across  the  choice  set.  Such  variables,  characteristics  of  the  chooser , 
include  demographic  characteristics  such  as  gender,  race,  and  educational  attainment. 

McFadden  (1974)  proposed  a  discrete-choice  model  for  explanatory  variables  that  are 

characteristics  of  the  choices.  For  subject  i  and  response  choice  j.  let  x,j  =  ( x-,j\ . x,jP)T 

denote  the  values  of  the  p  explanatory  variables,  and  let  jc,  =  (jc,  i . Xjp).  The  model  for 

the  probability  of  selecting  option  j  is 


ttj{Xj) 


exp(fiT  Xjj) 
J^he\p(PTxih)' 


(8.20) 


For  each  pair  of  choices  a  and  h.  this  model  has  the  logit  form  for  conditional  probabilities, 
\og[TTa{x,)/TTh{x,  ))  =  pT(Xia  ~  Xih).  (8.21) 


Conditional  on  the  choice  being  a  or  ft,  a  variable's  influence  depends  on  the  distance 
between  the  subject’s  values  of  that  variable  for  those  choices.  If  the  values  are  the  same, 
the  model  asserts  that  the  variable  has  no  influence  on  the  choice  between  a  and  ft.  Reflecting 
this  property,  McFadden  originally  referred  to  model  (8.20)  as  a  conditional  logit  model. 

From  (8.21 ),  the  odds  of  choosing  a  over  ft  do  not  depend  on  the  other  alternatives  in  the 
choice  set  or  on  their  values  of  the  explanatory  variables.  Luce  (1959)  called  this  property 
independence  from  irrelevant  alternatives.  It  is  unrealistic  in  some  applications.  For  in¬ 
stance,  for  travel  options  auto  and  red  bus,  suppose  that  80%  choose  auto,  corresponding  to 
an  odds  of  4.0.  Now  suppose  that  the  options  are  auto,  red  bus,  and  blue  bus.  According  to 
(8.21 ),  the  odds  are  still  4.0  of  choosing  auto  instead  of  red  bus.  but  intuitively,  we  expect 
them  to  be  about  8.0  (if  about  10%  choose  each  bus  option),  McFadden  (1974)  stated:  “Ap¬ 
plication  of  the  model  should  be  limited  to  situations  where  the  alternatives  can  plausibly 
be  assumed  to  be  distinct  and  weighed  independently  in  the  eyes  of  each  decision-maker." 

McFadden's  model  is  actually  a  bit  more  general,  permitting  the  choice  set  to  vary 
among  subjects.  For  instance,  some  subjects  may  not  have  the  subway  as  an  option  for 
travel  to  work.  In  the  denominator  of  (8.20),  the  sum  is  then  taken  over  the  choice  set  for 
subject  i. 
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8.5.2  Multinomial  Logit  Model  Expressed  as  Discrete-Choice  Model 

Discrete-choice  models  can  also  incorporate  explanatory  variables  that  are  characteristics 
of  the  chooser.  This  may  seem  surprising,  since  formula  (8.20)  has  a  single  parameter  for 
each  explanatory  variable;  that  is,  the  parameter  vector  is  the  same  for  each  pair  of  choices. 
However,  multinomial  logit  model  (8.2)  has  this  discrete-choice  form  when  we  replace  such 
an  explanatory  variable  by  J  artificial  variables.  The  /th  is  the  product  of  the  explanatory 
variable  with  a  indicator  variable  that  equals  I  when  the  response  choice  is  j.  For  instance, 
for  a  single  explanatory  variable,  let  x,  denote  its  value  for  subject  i.  For  j  =  1 let 
Sn  equal  1  when  k  —  j  and  0  otherwise,  and  let 

Zij  =  ( Sj  i  ,...,8jj,8jiXj,...,  SjjXj  )T . 


Let  p  —  (a\, . . .  ,aj ,  Pi, . . . ,  fSj)T  .Then  PT  z,j  =  a,  +  PjXj,  and  (8.2)  is  (with  ay  =  fij  = 
0  for  identifiability) 


_ exp(a  j  +  PjXj) _ 

7T' '  ''  exp(ai  +  P\x,)  H - F  exp(ay  +  Pjx,) 

_  _ exp (PT zij) _ 

exp  (PTZn)-i - +exp(PTZij)' 

This  has  the  discrete-choice  model  form  (8.20). 

With  this  approach,  discrete-choice  models  can  contain  characteristics  of  the  chooser 
and  of  the  choices.  Thus,  the  model  is  very  general.  The  ordinary  multinomial  logit  model 
using  baseline-category  logits  is  a  special  case. 

8.5.3  Example:  Shopping  Destination  Choice 

McFadden  (1974)  used  discrete-choice  models  to  describe  how  residents  of  Pittsburgh, 
Pennsylvania,  chose  a  shopping  destination.  The  five  possible  destinations  were  different 
city  zones.  One  explanatory  variable  measured  5  =  shopping  opportunities,  defined  to  be 
the  retail  employment  in  the  zone  as  a  percentage  of  total  retail  employment  in  the  region. 
The  other  explanatory  variable  was  P  —  price  of  the  trip,  defined  from  a  separate  analysis 
using  auto  in-vehicle  time  and  auto  operating  cost. 

The  ML  estimates  of  model  parameters  were  —1 .06  ( SE  =  0.28)  for  price  of  trip  and 
0.84  (SE  =  0.23)  for  shopping  opportunity.  From  (8.21), 

\og(iTa/nh)  =  -1.06 (Pa  -  Ph)  +  0.84(5„  -  Sh). 

Not  surprisingly,  a  destination  is  relatively  more  attractive  as  the  trip  price  decreases  and 
as  the  shopping  opportunity  increases. 


8.5.4  Multinomial  Probit  Discrete-Choice  Models 

Let  Uij  denote  the  utility  of  alternative  j  for  subject  i.  Suppose  that 


Ujj  —  P  Xjj  +  €jj 


(8.22) 
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and  the  response  choice  is  the  value  of  j  having  maximum  utility.  McFadden  (1974) 
showed  that  the  assumption  that  { )  are  independent  and  have  the  standard  extreme  value 
distribution  is  equivalent  to  discrete-choice  model  (8.20).  (Recall  Note  7.2  and  Section 
8.1.6.) 

One  way  such  a  construction  may  be  unrealistic  is  when  the  error  terms  partly  represent 
unobserved  covariates  that  are  correlated  with  the  response  variable.  Then,  e,a  and  elft  are 
unlikely  to  be  independent.  However,  this  utility  structure  suggests  alternative  models,  in 
particular,  ones  that  do  not  have  the  property  of  independence  from  irrelevant  alternatives. 
When  we  assume  that  €  =  (e,i , . . . ,  e,y )  has  a  multivariate  normal  N( 0,  E)  distribution,  this 
utility  model  is  a  multinomial  probit  model,  extending  the  model  of  Section  8.1.6.  Model 
identifiability  requires  constraints  on  E,  such  as  by  taking  var(e,  i)  =  1  (Hausman  and  Wise 
1978).  Multinomial  probit  models  are  more  complex  computationally,  requiring  numerical 
integration  or  simulation  to  obtain  the  likelihood  function. 

8.5.5  Extensions:  Nested  Logit  and  Mixed  Logit  Models 

In  permitting  correlated  errors  among  response  categories  in  a  model  for  utilities,  we  could 
instead  assume  that  t  has  a  multivariate  form  of  extreme  value  distribution.  This  induces 
generalized  logistic  models.  For  example,  McFadden  considered  applications  in  which  the 
choice  categories  are  partitioned  into  groups  having  a  tree-like  structure,  with  each  group 
consisting  of  similar  alternatives  and  having  correlated  error  terms  within  groups.  This  is 
useful  when  the  choices  are  naturally  nested.  An  example  is  a  person’s  choice  of  where  to 
live:  The  person  first  chooses  one  of  several  communities  to  live  in,  and  then  within  that 
community  chooses  a  type  of  dwelling.  Such  a  model  for  nested  choices  is  called  a  nested 
logit  model.  Train  (2009,  pp.  77-88)  gave  an  overview  and  multiple  references. 

Multinomial  logit  and  probit  discrete-choice  models  can  be  further  generalized  by  treat¬ 
ing  certain  effects  as  random  rather  than  fixed,  in  the  spirit  of  models  considered  later  in 
this  text  in  Chapters  12and  13.  A  mixed  logit  model  is  one  in  which  choice  probabilities  are 
obtained  by  integrating  the  logistic  expression  (8.20)  for  choice  probabilities  with  respect 
to  a  distribution  for  certain  model  parameters.  This  allows  heterogeneity  among  subjects 
in  the  size  of  effects.  It  is  useful  as  a  mechanism  for  inducing  positive  association  among 
repeated  responses  with  longitudinal  data.  Estimates  of  the  parameters  of  the  mixing  dis¬ 
tribution  provide  information  about  the  average  effects  and  the  extent  of  the  heterogeneity. 
Individual  effects  can  also  be  predicted.  For  details,  see  McFadden  (1974),  Skrondal  and 
Rabe-Hesketh  (2004,  Chap.  13),  and  Train  (2009,  Chap.  6). 

8.5.6  Extensions:  Discrete  Choice  with  Ordered  Categories 

Sometimes  the  response  categories  have  a  natural  ordering,  such  as  the  choice  in  renting 
a  car  among  (subcompact,  compact,  midsize,  large)  size  levels.  Standard  discrete-choice 
models  do  not  account  for  such  ordering.  For  multinomial  logit  models,  the  property  of 
independence  from  irrelevant  alternatives  may  then  be  especially  unrealistic,  as  a  particular 
response  category  is  more  similar  to  categories  near  it  than  categories  further  away. 

Small  (1987)  proposed  a  model  related  to  McFadden’s  multivariate  extreme  value  model 
for  utilities.  In  his  model,  the  correlation  between  utility  components  for  alternatives  a  and 
b  is  a  nonincreasing  function  of  | a  —  b |.  Another  approach  uses  a  multinomial  probit  model 
with  structure  on  the  covariance  matrix  for  e  that  reflects  the  ordinality.  For  example,  the 
correlation  might  have  the  autoregressive  structure  whereby  corr(e,a,  eih)  = 
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Beggs  et  al.  (1981)  considered  an  alternative  type  of  ordered-altematives  problem  in 
which  subjects  fully  rank  the  outcome  categories  from  best  to  worst.  The  categories  need 
not  themselves  be  ordered.  They  assumed  the  utility  model  (8.22),  assuming  iid  extreme 
value  errors.  Let  (/•,], . .  .  ,r,y)  denote  the  ranking  by  subject  /  of  the  J  choices,  where  r, \ 
is  the  response  category  given  the  highest  ranking  and  r,j  is  the  response  category  given 
the  lowest  ranking.  Based  on  convenient  properties  of  conditional  distributions  for  extreme 
value  distributions,  they  showed  that  that  ranking  vector  for  subject  i  has  probability 


7-1 


P(Urn  >  Ura  > 


<4,>=n 


h= I  L 


exp(/?r*rJ  /  ]Texp (pTxrJ 


n—h 


Summing  the  logs  of  these  terms  over  the  n  subjects  yields  the  multinomial  log  likelihood.  It 
can  be  maximized  using  Newton-Raphson  methods.  Beggs  et  al.  (1981)  applied  the  model 
to  data  in  which  various  car  types  were  ranked  and  the  explanatory  variables  included  car 
choice  characteristics  such  as  price,  fuel  cost,  whether  gas-powered  or  electric-powered, 
and  subject  socioeconomic  family  characteristics. 


8.6  BAYESIAN  MODELING  OF  MULTINOMIAL  RESPONSES 

The  Bayesian  approach  for  binary  regression  models  extends  to  multinomial  models.  We 
focus  here  on  Bayesian  fitting  of  cumulative  link  models  for  ordinal  responses  and  of 
multinomial  (baseline-category)  logit  and  probit  models  for  nominal  responses. 

8.6.1  Bayesian  Fitting  of  Cumulative  Link  Models 

For  an  ordinal  response  Y,  many  models  are  special  cases  of  the  cumulative  link  model, 

G~'[P{Y  <j\x)]=aj-pTx. 

From  Section  8.2.3,  this  model  is  implied  by  a  regression  model  for  a  latent  variable  having 
cdf  G,  such  as  logistic  for  the  logit  link.  Prior  distributions  for  the  cutpoint  parameters 
a  =  (a i, ... ,  a(-i)  should  take  into  account  the  ordering  constraint 

— oo  <  a\  <  ci2  <  ■  ■  ■  <  ac-\  <  oo. 


In  the  cumulative  probit  case,  the  latent  response  for  observation  i  is 

Y*  =  PTxi+€l, 

where  {e, }  are  independent  N(0,  1).  Albert  and  Chib  (1993)  presented  a  Bayesian  analysis 
that  utilizes  the  latent  variable  model  and  extends  the  analysis  of  Section  7.2.6  for  binary 
responses.  This  model  is  simpler  to  handle  than  the  cumulative  logit  model,  because 
results  apply  from  Bayesian  inference  for  ordinary  normal  linear  regression  models,  with  a 
multivariate  normal  prior  distribution  for  the  regression  parameters  and  independent  normal 
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latent  variables  y*  =  ( vj\  . . . ,  v*)7.  Implementation  ofMCMC  methods  is  relatively  simple 
because  the  Monte  Carlo  sampling  is  from  normal  distributions. 

A  Gibbs  sampling  scheme  determines  the  posterior  distribution  by  successively  sam¬ 
pling  from  the  density  of  (1)  y*  given  y,  ft,  and  a,  (2)  ft  given  y,  y*,  and  a,  and  (3)  a  given 
y,  y*,  and  ft.  It  uses  the  fact  that  if  y,  =  j,  then  y*  is  between  ay_i  and  a;.  For  example, 
given  y,  and  a,  the  conditional  density  function  of  y*  is  normal  with  mean  ft1  x,  and 
variance  1  but  truncated  between  the  two  cutpoints  corresponding  to  the  value  of  y,-.  Since 
y*  determines  y,  the  conditional  density  of  ft  given  y,  y* ,  and  a  is  proportional  to  the  prior 
of  ft  times  the  density  of  y*  given  ft,  which  is  normal  since  both  components  are  normal. 
The  conditional  density  function  of  a  given  y,  y* ,  and  ft  is  proportional  to  its  truncated 
normal  prior  density  but  truncated  to  reflect  that  a;  must  fall  above  all  y*  such  that  y,  =  j 
but  below  all  y  *  such  that  y,  =  j  +  1 . 

Albert  and  Chib  generalized  the  model  to  use  link  functions  based  on  inverse  cdf’s  for 
the  1  distribution.  Since  the  logistic  distribution  relates  closely  to  the  t  distribution  with  df 
=  8  (as  described  in  Sections  4.2.5  and  7.2.6),  this  also  provides  a  relatively  simple  way  of 
fitting  corresponding  cumulative  logit  models.  Alternatively,  these  days  it  is  straightforward 
to  use  MCMC  directly  with  the  product  of  the  chosen  prior  densities  and  the  multinomial 
likelihood  for  the  chosen  model,  regardless  of  the  link  function. 

8.6.2  Example:  Cannabis  Use  and  Mother’s  Age 

Table  8.12  comes  from  a  2 1  -year  follow-up  study  of  mothers  and  their  children  who  received 
antenatal  care  at  a  public  hospital  in  Brisbane,  Australia.  At  the  age  of  21,  the  children 
were  asked  “In  the  last  month,  how  often  did  you  use  cannabis,  marijuana,  pot,  etc.?”  One 
explanatory  variable  was  the  mother’s  age  at  entry  to  the  study. 

The  deviance  statistic  for  testing  goodness  of  fit  of  the  independence  model  in  this  table 
is  G 2  =  7.71  (df  —  4,  P  —  0.10).  There  is  not  much  evidence  of  association,  but  this 
statistic  ignores  the  ordinality  of  the  response.  Let’s  consider  the  cumulative  logit  model  of 
proportional  odds  form, 


logitP(T  <  j)  =  a  j  +  ftx, 

with  x  =  1  for  age  >  20  and  x  =  0  otherwise.  The  deviance  is  now  3.18  (df  =  3).  The 
ML  estimate  ft  =  0.230  {SE  =  0.107)  and  likelihood-ratio  statistic  of  4.53  (df  =  1,  P  = 
0.033)  for  testing  Hq:  ft  =  0  show  considerable  evidence  that  cannabis  use  tended  to  be 
lower  when  mother’s  age  was  higher.  The  95%  profile  likelihood  confidence  interval  for  ft 
is  (0.018,  0.441). 


Table  8.12  Cannabis  Use  at  21  Years  by  Mother’s  Age  at  Study  Entry 

Cannabis  Use  at  21  Years 


Mother’s 

Never 

Not  Last 

Once  Last 

Every 

Every 

Age 

Use 

Month 

Month 

Few  Days 

Day 

<20  years 

154 

91 

42 

27 

30 

>20  years 

1078 

567 

261 

157 

1 1 1 

Source :  Hayatbakhsh  et  al.  Am.  J.  Drug  &  Alcohol  Abuse,  36:  350-356,  2010. 
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For  a  Bayesian  analysis  we  recode  x  to  take  values  0.5  and  —0.5  instead  of  1  and 
0,  so  the  cumulative  logits  in  each  row  have  the  same  prior  variability.  For  an  analysis 
with  uninformative  priors,  we  used  independent  normal  priors  for  the  model  parameters 
(appropriately  truncated  for  {a,})  with  means  of  0  and  standard  deviations  of  10.  The 
posterior  distribution  of  p,  based  on  a  Markov  chain  of  length  one  million,  then  has  a  mean 
of  0.229  and  a  standard  deviation  of  0. 108.  The  equal-tail  95%  posterior  interval  for  /i  is 
(0.0 1 8 ,  0.44 1 ),  and  the  posterior  P  (p  <  0)  =  0.017.  The  sample  size  is  large,  so  results  are 
similar  to  those  obtained  with  the  frequentist  approach.  The  Bayesian  posterior  P(P  <  0) 
is  comparable  to  a  frequentist  one-sided  P- value  for  Ha  :  p  >0. 

8.6.3  Bayesian  Fitting  of  Multinomial  Logit  and  Probit  Models 

For  nominal-scale  responses,  Albert  and  Chib  (1993)  presented  Bayesian  fitting  of  the 
multinomial  probit  model,  using  the  connection  with  the  latent  utility  model  for  maxima  of 
normal  random  variates  outlined  in  Section  8.1.6.  We  discuss  this  in  terms  of  the  discrete- 
choice  form  of  the  model  outlined  in  Section  8.5.4,  since  standard  models  that  do  not  have 
characteristics  of  the  choices  as  explanatory  variables  are  special  cases.  The  underlying 
model  for  the  utility  for  subject  i  making  response  j  is 

Ujj  =  P  Xjj  +  (jj 


and  the  response  choice  is  the  value  of  j  having  maximum  utility. 

Let  U j  =(f/,i,...,  Ujj).  When  the  errors  are  independent  standard  normal  and  we 
use  a  diffuse  normal  prior  for  p,  a  Gibbs  sampling  scheme  approximates  the  posterior 
distribution  by  successively  sampling  from  the  normal  conditional  densities  of  ( 1 )  p  given 
y,  V i, . . .  ,Un,  and  (2)  V  \, . . .,  U  u  given  y  and  p.  In  case  (2)  the  distribution  is  truncated 
to  reflect  that  if  y,  =  j  then  component  j  of  V  is  its  maximum.  The  model  can  also  extend 
to  let  the  utility  components  be  correlated  by  introducing  a  parameter  0  for  the  covariance 
matrix,  such  as  a  common  correlation.  A  third  step  of  the  Gibbs  sampling  then  includes 
sampling  from  its  conditional  density. 

McCulloch  et  al.  (2000)  also  dealt  with  multinomial  probit  models  in  terms  of  the 
underlying  latent  model.  They  noted  the  difficulty  in  placing  priors  on  a  covariance  matrix 
that  incorporate  an  identifiability  constraint  such  as  var((/,  i)  =  1.  and  proposed  priors  that 
can  account  for  that  constraint.  See  also  Imai  and  van  Dyk  (2005). 

For  baseline-category  logit  models,  such  routines  do  not  connect  with  standard  ones 
for  normal  variables,  because  the  utility  construction  uses  errors  with  extreme  value  dis¬ 
tributions.  However,  it  is  not  necessary  to  base  computations  on  latent  variable  models  or 
on  the  general  discrete-choice  version  of  the  model,  and  with  normal  priors  for  the  model 
parameters,  software  is  widely  available.  With  relatively  diffuse  priors,  substantive  results 
are  usually  similar  to  those  with  corresponding  probit  models. 

Note,  however,  that  if  you  place  simple  structure  such  as  a  common  variance  for  the 
priors  for  pit,  pit,  •  •  ■ ,  fij  i,k  in  model  (8.1),  posterior  results  then  depend  somewhat  on 
the  choice  of  baseline  category,  because  an  effect  relative  to  a  pair  of  nonbaseline  categories, 
Pjt  —  Pj'k ,  then  has  twice  the  prior  variance.  Alternatively,  you  can  overparameterize  by 
adding  pjt  to  the  model  with  the  same  prior  but  focus  on  the  posterior  differences  for 
interpretation.  The  same  remark  applies  to  factors  in  such  models,  as  results  should  ideally 
be  invariant  to  the  choice  of  a  baseline  category  for  indicators.  One  way  to  do  this  is  to 
conduct  the  analysis  in  terms  of  corresponding  Poisson  loglinear  models,  introduced  in 
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Table  8.13  Estimated  Size  Effects  and  Standard  Errors  in  Multinomial  Logistic  Model  for 
Alligator  Food  Choice,  Using  Size  and  Lake  as  Predictors 


Baseline 

Logit 

Maximum 

Likelihood 

Bayes,  Prior  cr  =  100 

Bayes,  Prior  a  =  1 

hi 

SE 

Ay 

Std.  Dev. 

Ay 

Std.  Dev. 

log(7r,/7rf) 

1.46 

0.40 

1.52 

0.40 

1.26 

0.38 

log  (7tK/jtF) 

-0.35 

0.58 

-0.39 

0.60 

-0.55 

0.48 

l0g(7 TB/7tF) 

-0.63 

0.64 

-0.68 

0.67 

-0.36 

0.51 

\0g(TT0/TTF) 

0.33 

0.45 

0.35 

0.46 

0.23 

0.43 

/,  invertebrate;  R,  reptile;  B,  bird;  O,  other;  F,  fish. 


the  next  chapter,  which  need  not  identify  a  baseline  category  for  any  categorical  variable 
(Gelman  et  al.  2004,  pp.  431—433).  See  www.stat.ufl.edu/~aa/cda/cda.html  for 
details  in  terms  of  the  following  example.  For  examples  of  Bayesian  uses  of  such  models, 
see  references  cited  in  Note  8. 12. 


8.6.4  Example:  Alligator  Food  Choice  Revisited 

For  the  alligator  food  choice  data  introduced  in  Section  8. 1 .2,  we  found  that  the  probability 
of  selecting  a  particular  food  choice  was  described  well  by  a  model  with  additive  effects  of 
size  s  and  indicators  contrasting  lakes  Hancock,  Oklawaha,  and  Trafford  with  George.  For 
the  baseline  choice  of  fish,  the  model  is 


log(jry/;rF)  =  ctj  +  p ijs  +  p2jzH  +  PijZo  +  PijzT,  j  =  1 . 2,  3, 4. 

As  there  was  little  prior  information,  especially  about  the  lake  effects,  we  fitted  the  model 
using  diffuse  independent  normal  prior  distributions.  Table  8. 1 3  shows  posterior  means  and 
standard  deviations  for  the  size  effect,  when  we  parameterize  in  such  a  way  that  the  10 
conditional  log  odds  ratios  relating  size  to  pairs  of  food  choice  categories  all  have  normal 
distributions  with  fi  —  0  and  a  =  100.  Corresponding  ML  estimates  and  SE  values  are  also 
shown.  With  such  uninformative  priors,  results  are  quite  similar.  With  either  analysis,  we 
conclude  that  the  smaller  alligators  are  relatively  more  likely  to  have  invertebrates  as  their 
primary  food  choice. 

To  compare  with  results  from  a  highly  informative  Bayesian  analysis,  we  used  normal 
priors  for  the  ten  log  odds  ratios  between  size  and  pairs  of  food  choices  with  each  a  =  1 . 
Table  8. 13  shows  results.  Having  more  prior  information  centered  at  0  results  in  shrinkage 
of  posterior  estimates  and  standard  deviations  toward  0. 


NOTES 

Section  8.1:  Nominal  Responses:  Baseline-Category  Logit  Models 

8.1  BCL  models:  Baseline-category  logit  models  were  developed  in  Bock  (1970),  Haberman 
(1974a,  pp.  352-373),  Mantel  ( 1 966),  Skrondal  andRabe-Hesketh  (2003. 2004,  Chap.  1 3),  and 
Theil  (1969.  1970).  Lesaffre  and  Albert  (1989)  presented  regression  diagnostics.  Amemiya 
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(1981),  Haberman  (1982),  and  Theil  (1970)  presented  /^-squared  measures.  Baker  (1994), 
Lang  (1996),  and  Tsodikov  and  Chefo  (2008)  showed  connections  with  Poisson  models. 
Kosmidis  and  Firth  (2011)  used  this  connection  in  giving  a  penalized  likelihood  for  bias 
reduction.  Tutz  and  Schauberger(2012)  proposed  graphics  for  effects  in  multinomial  response 
models. 


Section  8.2:  Ordinal  Responses:  Cumulative  Logit  Models 

8.2  Cumulative  logits:  Early  uses  of  cumulative  logit  models  include  Bock  and  Jones  (1968), 
Simon  (1974),  Snell  (1964),  Walker  and  Duncan  (1967),  and  Williams  and  Grizzle  (1972). 
McCullagh  (1980)  popularized  the  proportional  odds  case.  Later  articles  include  Agresti  and 
Lang  ( 1 993),  Hastie  and  Tibshirani  ( 1 987),  Peterson  and  Harrell  ( 1 990),  and  Tutz  ( 1 989).  See 
also  Note  12.2  and  Sections  12.2.3  and  13.4.1. 

8.3  Score  test,  power,  efficiency:  For  2x7  tables  and  the  model  logit[ P(Y  <  /)]  =  oij  +  fix, 
with  x  an  indicator,  McCullagh  (1980)  noted  that  the  score  test  of  H0:  fl  —  0  is  equivalent  to 
a  discrete  version  of  the  Wilcoxon-Mann-Whitney  test.  Whitehead  (1993)  gave  sample  size 
formulas  for  this  case.  The  sample  size  n  j  needed  for  a  certain  power  decreases  as  7  increases: 
When  response  categories  have  equal  probabilities,  nj  0.75«2/(  1  —  1/72).  The  efficiency 
loss  is  major  in  collapsing  to  7  =  2.  See  also  Rabbee  et  al.  (2003).  Natarajan  et  al.  (2012) 
extended  the  score  test  to  complex  sample  survey  data.  Edwardes  ( 1 997)  innovatively  adapted 
the  test  by  treating  the  outpoints  as  random.  Rice  et  al.  (2012)  discussed  ways  of  dealing  with 
variation  in  outpoints. 

8.4  ROC  curve:  As  a  way  of  evaluating  diagnostic  tests  that  have  7  >  2  ordered  response 
categories  rather  than  (positive,  negative),  an  ROC  curve  can  refer  to  the  various  possible 
cutoffs  for  defining  a  result  to  be  positive.  It  plots  sensitivity  against  I  —  specificity  for  the 
possible  collapsings  of  the./  categories  to  a  (positive,  negative)  scale  (Toledano  and  Gatsonis 
1996). 


Section  8.3:  Ordinal  Responses:  Alternative  Models 

8.5  Probit,  generalized  links:  Cumulative  probit  models  were  proposed  by  Aitchison  and  Silvey 
(1957)  for  the  one-way  layout  setting  and  Gurland  et  al.  (1960)  and  Bock  and  Jones  (1968, 
Chap.  8)  in  a  general  regression  setting.  McKelvey  and  Zavoina  (1975)  presented  the  under¬ 
lying  latent  normal  model.  Genter  and  Farewell  ( 1 985)  introduced  a  generalized  link  function 
that  permits  comparison  of  fits  provided  by  probit,  complementary  log-log,  and  other  links. 
Adjacent-categories  logit  models  and  models  equivalent  to  them  were  presented  by  Goodman 
( 1 979a,  1 983),  Haberman  ( 1 974b),  and  Simon  (1974).  Greene  and  Hensher  (20 1 0)  presented 
other  ordinal  modeling  strategies. 

8.6  Hazard/survival:  The  ratio  of  a  pdf  to  the  complement  of  the  cdf  is  the  hazard  function 
(Exercise  4.20).  For  discrete  variables,  this  is  the  ratio  found  in  continuation-ratio  logits. 
The  model  log[—  log(l  —  a>j{x))]  =  ay  +  flT  x  is  a  discrete-time  version  of  the  proportional 
hazards  model  (Allison  1982,  Aranda-Ordaz  1983,  Prentice  and  Gloeckler  1978,  Thompson 
1 977).  Laara  and  Matthews  ( 1 985)  showed  this  is  equivalent  to  the  model  using  the  same  link 
for  cumulative  probabilities. 

8.7  OLS  fitting:  Assigning  scores  to  ordered  response  categories  and  using  ordinary  least-squares 
regression  modeling  is  not  optimal,  because  the  observations  do  not  have  constant  variance. 
Instead  treating  the  response  as  multinomial,  with  categorical  predictors  Bhapkar  (1968), 
Grizzle  et  al.  ( 1 969),  and  Williams  and  Grizzle  ( 1 972)  used  weighted  least  squares,  and  Haber 
(1985)  and  Lipsitz  (1992)  used  ML.  For  large  7,  such  models  approximate  a  regression  model 
for  continuous  Y .  A  structural  difficulty  is  that  the  model  can  have  predicted  means  outside  the 
range  of  assigned  scores.  Also,  “floor  effects”  and  “ceiling  effects”  can  occur  when  a  latent 
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response  is  categorized  and  a  linear  model  is  fitted  to  the  observed  response.  See  Agresti 
(2010,  Sec.  1.3,  5.6)  for  details. 

8.8  Dispersion  effects:  McCullagh  (1980)  generalized  the  cumulative  link  model  to  incorporate 
dispersion  effects.  With  link  function  g ,  the  model  is 


g[P(Y  S  j)]  = 


g;  ~  PTx 

exp  (yT  x) 


The  denominator  contains  scale  parameters  y  that  describe  how  the  dispersion  depends  on 
x.  This  model  arises  from  a  latent  variable  model  in  which  the  distribution  of  Y*  has  shape 
reflected  by  g,  such  as  normal  for  the  probit  link.  The  latent  variable  has  E(Y*)  =  f)1  x  and 
standard  deviation  exp (yT  x)  that  varies  as  x  does.  See  also  Agresti  (2010,  Sec.  5.4)  and  Cox 
(1995).  Hamada  and  Wu  (1990)  and  Nair  (1987)  presented  alternatives  models  for  detecting 
dispersion  effects. 


Section  8.4:  Testing  Conditional  Independence  in  IxJxK  Tables 

8.9  Generalized  CMH:  Birch  (1965),  Landis  et  al.  (1978),  Mantel  (1963),  and  Mantel  and  Byar 
(1978)  generalized  the  CMH  statistic.  Let  the  Kronecker  product  Bk  =  uk  ®  vk  denote  a 
matrix  of  constants  based  on  row  scores  uk  and  column  scores  vk  for  stratum  k.  The  Landis 
et  al.  (1978)  generalized  statistic  is 

L2  =  [  J2  Bk(nk  -  #i*)] '  [  ]T  Bk  Bdnk  -  ft,)]. 

k  k  k 


When  «,  =  («i . u ,)  and  v,  =  (v,, . . . ,  \’j)  for  all  strata,  L 2  —  M2  in  (8.19).  When  «,  is 

an  (/  —  1)  x  /  matrix  (/,  —1),  where  /  is  an  identity  matrix  of  size  (/  —  1)  and  1  denotes 
a  column  vector  of  /  —  1  ones,  and  v,  is  the  analogous  matrix  of  size  (J  —  1)  x  J.L1 
simplifies  to  (8.18)  with  df  =  (/  —  1)(7  —  1).  With  this  «,  and  v,  =  (vi, . . . ,  Vj).  L2  sums 
over  the  strata  information  about  how  /  row  means  compare  to  their  null  expected  values, 
and  it  has  df  =  /  —  1.  Rank  score  versions  are  analogs  for  ordered  categorical  responses 
of  stratum-adjusted  Spearman  correlation  and  Kruskal- Wallis  tests.  Kawaguchi  et  al.  (201 1) 
extended  the  Mantel-Haenszel  odds  ratio  estimate  to  stratified  Mann-Whitney  estimators  that 
utilize  probability  comparisons  of  two  groups  [related  to  A  in  (2.15)].  Landis  et  al.  (2005)  and 
Stokes  et  al.  (2012)  reviewed  CMH  methods.  Koch  et  al.  (1982)  reviewed  related  methods. 

8.10  Small-sample  tests  of  conditional  independence:  To  eliminate  nuisance  parameters,  small- 
sample  tests  condition  on  row  and  column  totals  in  each  partial  table.  Section  7.3.5  showed 
this  for  2  x  2  x  A-  tables.  When  /  >  2  and/or  J  >  2,  the  conditional  distribution  of  cell  counts 
in  each  stratum  is  the  multivariate  hypergeometric  (Section  16.5.1),  and  this  propagates  an 
exact  conditional  distribution  for  the  test  statistic  of  interest,  such  as  a  generalized  CMH 
statistic  (Kim  and  Agresti  1997). 


Section  8.5:  Discrete-Choice  Models 

8.11  McFadden/Bradley-Terry/Luce:  McFadden's  model  relates  to  models  proposed  by  Bradley 
and  Terry  (1952)  (see  Section  1 1.6)  and  Luce  (1959).  Train’s  (2009)  overview  text  includes 
many  generalized  models,  and  pages  45-50  discuss  the  independence  from  irrelevant  alterna¬ 
tives  assumption  and  references  articles  dealing  with  testing  whether  that  property  holds.  One 
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approach  uses  standard  tests  to  compare  it  to  a  more  complex  nested  logit  model  mentioned 
in  Section  8.5.5. 


Section  8.6:  Bayesian  Modeling  of  Multinomial  Responses 

8.12  Bayes  multinomial:  For  other  discussion  of  utilizing  the  connection  with  an  underlying 
latent  variable  model,  see  Hoff  (2009,  Sec.  12.1)  and  Johnson  and  Albert  (1999,  Chap.  4). 
See  also  Congdon  (2005,  Chap.  7),  and  many  references  in  Agresti  (2010,  Chap.  1 1).  For 
comparing  two  ordinal  categorical  distributions,  Altham  (1969)  provided  a  Bayesian  estimate 
of  the  probability  that  one  distribution  is  stochastically  higher  than  the  other.  For  Bayesian 
inference  with  baseline-category  logit  models,  see  Congdon  (2005,  Chap.  6),  Daniels  and 
Gatsonis  (1997),  Holmes  and  Held  (2006),  Leonard  and  Hsu  (1994),  and  Sha  et  al.  (2004). 


EXERCISES 

Applications 

8.1  For  Table  8.14,  let  Y  =  belief  in  existence  of  heaven,  x\  =  gender  (1  = 
females,  0  =  males),  and  xi  —  race  (1  =  blacks,  0  =  whites).  Table  8.15  shows 
the  fit  of  the  model 


log  (Jlj/JIT,)  =  otj  +  pfx  i  +  PjX  2,  j  —  1,2, 


with  SE  values  in  parentheses. 

a.  Find  the  prediction  equation  for  log(^i/^2)- 

b.  Using  the  yes  and  no  response  categories,  interpret  the  conditional  gender  effect 
using  a  95%  confidence  interval  for  an  odds  ratio. 


Table  8. 14  Data  on  Belief  in  Existence  of  Heaven  for  Exercise  8.1 


Race 

Gender 

Belief  in  Heaven 

Yes 

Unsure 

No 

Black 

Female 

88 

16 

2 

Male 

54 

7 

5 

White 

Female 

397 

141 

24 

Male 

235 

189 

39 

Source :  2008  General  Social  Survey. 

Table  8.15  Fit  of  Model  for  Belief  in  Heaven  for  Exercise  8.1 

Belief  Categories  for  Logit 

Parameter 

Yes/No 

Unsure/No 

Intercept 

1.785  (0.168) 

1.554(0.172) 

Gender 

1.044  (0.259) 

0.254  (0.269) 

Race 

0.703  (0.411) 

-0.106  (0.438) 
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c.  Find  7zr  =  P(Y  —  yes)  for  white  females. 

d.  Without  calculating  estimated  probabilities,  explain  why  the  intercept  estimates 
indicate  that  for  white  males,  ft \  >  A2  >  ft 3.  Use  the  intercept  and  gender  esti¬ 
mates  to  show  that  the  same  ordering  applies  for  black  females. 

e.  Without  calculating  estimated  probabilities,  explain  why  the  estimates  in  the 
gender  row  indicate  that  ft  1  is  higher  for  females  than  for  males,  for  each  race. 

f.  For  this  fit,  G2  =  0.69.  Explain  why  residual  df  =  2.  Deleting  the  gender  effect, 
G 2  =  47.64.  Conduct  a  likelihood-ratio  test  of  whether  opinion  is  independent 
of  gender,  given  race.  Interpret. 

8.2  A  model  fit  predicting  preference  for  U.S.  President  (Democrat,  Republican,  Inde¬ 
pendent)  using  x  =  annual  income  (in  $10,000)  is  log (fto/fti)  =  3.3  —  0.2x  and 
log(jr^/7T/)  =  1.0  +  0.3x. 

a.  Find  the  prediction  equation  for  \og(ftn/fto)  and  interpret  the  slope.  For  what 
range  of  x  is  jf«  >  jfD? 

b.  Find  the  prediction  equation  for  ft/ . 

c.  Plot  ft d  ,  ft / ,  and  jf«  for  x  between  0  and  10,  and  interpret. 

8.3  Table  8. 16  shows  recent  GSS  data  for  the  effect  of  gender  and  race  on  political  party 
identification.  Find  a  baseline-category  logit  model  that  fits  well.  Interpret  estimated 
effects  on  the  odds  that  party  identification  is  Democrat  instead  of  Republican. 


Table  8.16  Data  for  Exercise  8.3  on  Political  Party  ID 


Gender 

Race 

Political  Party  Identification 

Democrat 

Republican 

Independent 

Male 

White 

132 

176 

127 

Black 

42 

6 

12 

Female 

White 

172 

129 

130 

Black 

56 

4 

15 

8.4  For  63  alligators  caught  in  Lake  George,  Florida,  Table  8.17  classifies  primary  food 
choice  as  (fish,  invertebrate,  other)  and  shows  length  in  meters.  Alligators  are  called 
subadults  if  length  <1.83  meters  (6  feet)  and  adults  if  length  >  1 .83  meters. 

a.  Measuring  length  as  (adult,  subadult),  find  a  model  that  adequately  describes 
effects  of  gender  and  length  on  food  choice.  Interpret  the  effects.  For  adult 
females,  find  the  estimated  probabilities  of  the  food  choice  categories. 

b.  Using  only  observations  for  which  primary  food  choice  was  fish  or  invertebrate, 
find  a  model  that  adequately  describes  effects  of  gender  and  binary  length.  Com¬ 
pare  parameter  estimates  and  standard  errors  for  this  separate-fitting  approach  to 
those  obtained  with  simultaneous  fitting,  including  the  other  category. 

c.  Treating  length  as  binary  loses  information.  Adapt  the  model  in  part  (a)  to  use 
the  continuous  length  measurements.  Interpret,  explaining  how  the  estimated 
outcome  probabilities  vary  with  length.  Find  the  estimated  length  at  which  the 
invertebrate  and  other  categories  are  equally  likely. 
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Table  8.17  Data  for  Exercise  8.4“  on  Alligator  Food  Choice 


Males  Females 


Length 

(m) 

Choice 

Length 

(m) 

Choice 

Length 

(m) 

Choice 

Length 

(m) 

Choice 

1.30 

/ 

1.70 

/ 

3.33 

F 

1.78 

O 

1.32 

F 

1.73 

O 

3.56 

F 

1.80 

I 

1.32 

F 

1.78 

F 

3.58 

F 

1.88 

1 

1.40 

F 

1.78 

O 

3.66 

F 

2.16 

F 

1.42 

I 

1.80 

F 

3.68 

O 

2.26 

F 

1.42 

F 

1.85 

F 

3.71 

F 

2.31 

F 

1.47 

I 

1.93 

/ 

3.89 

F 

2.36 

F 

1.47 

F 

1.93 

F 

1.24 

I 

2.39 

F 

1.50 

/ 

1.98 

/ 

1.30 

I 

2.41 

F 

1.52 

/ 

2.03 

F 

1.45 

I 

2.44 

F 

1.63 

/ 

2.03 

F 

1.45 

O 

2.56 

O 

1.65 

O 

2.31 

F 

1.55 

I 

2.67 

F 

1.65 

O 

2.36 

F 

1.60 

I 

2.72 

/ 

1.65 

I 

2.46 

F 

1.60 

/ 

2.79 

F 

1.65 

F 

3.25 

O 

1.65 

F 

2.84 

F 

1.68 

F 

3.28 

O 

1.78 

1 

“ F,  fish:  /.  invertebrates;  0,  other. 


8.5  Fit  the  multinomial  probit  model  to  the  alligator  food  choice  data  in  Table  8. 1  and  at 
the  text  website,  with  size  and  lake  as  predictors.  Compare  estimates  and  SE  values 
to  those  in  Table  8.4,  and  explain  why  they  are  larger  for  the  multinomial  logit 
model. 

8.6  Fit  the  baseline-category  logit  model  with  main  effects  to  the  data  in  Table  8.5. 
Describe  the  effect  of  the  sample  having  no  blacks  in  the  very  happy  category. 

8.7  For  recent  GSS  data,  the  cumulative  logit  model  (8.5)  with  Y  —  political  ideology 
(very  liberal,  slightly  liberal,  moderate,  slightly  conservative,  very  conservative)  and 
jc  =  party  affiliation  (1  for  the  428  Democrats  and  0  for  the  407  Republicans)  has 
P  —  0.975(S£  =  0.129)and<i|  =  —2.469.  Interpret/).  Find  the  estimated  probability 
of  a  very  liberal  response  for  each  group. 

8.8  Table  8.18  is  an  expanded  version  of  a  data  set  analyzed  in  Section  9.4.2.  The 
response  categories  are  ( 1 )  not  injured,  (2)  injured  but  not  transported  by  emergency 
medical  services,  (3)  injured  and  transported  by  emergency  medical  services  but  not 
hospitalized,  (4)  injured  and  hospitalized  but  did  not  die,  and  (5)  injured  and  died. 
Table  8. 19  shows  output  for  a  model  of  form  (8.5). 

a.  Why  are  there  four  intercepts?  Explain  how  they  determine  the  estimated  response 
distribution  for  males  in  urban  areas  wearing  seat  belts. 

b.  Construct  a  confidence  interval  for  the  effect  of  gender,  given  seat-belt  use  and 
location.  Interpret. 
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Table  8.18  Data  for  Exercise  8.8  on  Degree  of  Injury  in  Auto  Accident 


Response  on  Injury  Outcome 

Gender  Location 

Seat  Belt 

I 

2 

3 

4  5 

Female  Urban 

No 

7,287 

175 

720 

91  10 

Yes 

11,587 

126 

577 

48  8 

Rural 

No 

3,246 

73 

710 

159  31 

Yes 

6,134 

94 

564 

82  17 

Male  Urban 

No 

10,381 

136 

566 

96  14 

Yes 

10,969 

83 

259 

37  1 

Rural 

No 

6,123 

141 

710 

188  45 

Yes 

6,693 

74 

353 

74  12 

Source:  Data  courtesy  of  Cristanna  Cook,  Medical  Care  Development,  Augusta,  Maine. 

Table  8.19  Output  for  Exercise  8.8  on  Auto  Accident  Injuries 

Parameter 

DF 

Estimate 

Std  Error 

Interceptl 

1 

3  .  3074 

0 . 0351 

Intercept2 

1 

3.4818 

0 . 0355 

Intercept3 

1 

5.3494 

0 . 0470 

Intercept4 

1 

7.25G3 

0 . 0914 

gender 

female 

1 

-0.54G3 

0  .  0272 

gender 

male 

0 

0 .0000 

0 .0000 

location 

rural 

1 

-0.6988 

0 . 0424 

location 

urban 

0 

0 .0000 

0 .0000 

seatbelt 

no 

1 

-0.7602 

0.0393 

seatbelt 

yes 

0 

0 .0000 

0  .  0000 

location* seatbelt 

rural 

no 

1 

-0.1244 

0 . 0548 

location* seatbelt 

rural 

yes 

0 

0 .0000 

0 . 0000 

location* seatbelt 

urban 

no 

0 

0 .0000 

0 .0000 

location* seat be It 

urban 

yes 

0 

0 .0000 

0 . 0000 

c.  Find  the  estimated  cumulative  odds  ratio  between  the  response  and  seat-belt  use 
for  those  in  rural  locations  and  for  those  in  urban  locations,  given  gender.  Based 
on  this,  explain  how  the  effect  of  seat-belt  use  varies  by  region,  and  explain  how 
to  interpret  the  interaction  estimate,  —0.1244. 

8.9  In  a  class  project.  University  of  Florida  students  Shahrzad  Farshi  and  Marty  Parks 
used  GSS  data  to  study  the  effect  of  several  explanatory  variables  on  liking  for  rap 
music,  an  ordinal  variable  with  five  categories  ( 1  =  greatest  preference).  They  found 
a  good  fit  with  the  model  logit[P(T  <  j)]  =  aj  —  1.06r  —  0.58a,  where  race  r  was 
coded  1  for  white  and  0  for  black/other  and  age  a  has  scores  (1,  2,  3,  4)  for  four 
successive  age  categories.  Interpret  these  effects  with  cumulative  odds  ratios. 

8.10  Table  8.20  refers  to  a  clinical  trial  for  the  treatment  of  small-cell  lung  cancer. 
Patients  were  randomly  assigned  to  two  treatment  groups.  The  sequential  therapy 
administered  the  same  combination  of  chemotherapeutic  agents  in  each  treatment 
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Table  8.20  Data  for  Exercise  8.10  on  Lung  Cancer  Clinical  Trial 


Therapy 

Gender 

Response  to  Chemotherapy 

Progressive 

Disease 

No 

Change 

Partial 

Remission 

Complete 

Remission 

Sequential 

Male 

28 

45 

29 

26 

Female 

4 

12 

5 

2 

Alternating 

Male 

41 

44 

20 

20 

Female 

12 

7 

3 

1 

Source:  W.  Holtbrugge  and  M.  Schumacher,  Appl.  Statist.  40:  249-259.  1991. 


cycle;  the  alternating  therapy  had  three  different  combinations,  alternating  from 
cycle  to  cycle. 

a.  Fit  a  cumulative  logit  model  with  main  effects  for  therapy  and  gender.  Interpret 
effect  estimates. 

b.  For  the  therapy  effect,  compare  fi\  to  the  estimate  obtained  when  the  model 
is  fitted  to  the  binary  response  obtained  by  combining  the  first  two  response 
categories  and  combining  the  last  two  response  categories.  What  property  of  the 
model  does  this  reflect? 

C.  For  the  collapsing  in  (b),  compare  p\ /SE  to  the  ratio  obtained  for  the  uncollapsed 
response.  (Usually,  a  disadvantage  of  collapsing  ordinal  responses  is  that  the 
significance  of  effects  diminishes.) 

d.  Fit  the  model  to  the  uncollapsed  data  that  also  contains  an  interaction  term. 
Interpret.  Does  it  fit  better?  Explain  why  it  is  equivalent  to  using  the  four 
gender-therapy  combinations  as  levels  of  a  single  factor. 

8.11  A  study  of  factors  affecting  alcohol  consumption  measures  the  response  variable 
with  the  scale  (abstinence,  a  drink  a  day  or  less,  more  than  one  drink  a  day).  For 
a  comparison  of  two  groups  while  adjusting  for  relevant  covariates,  the  researchers 
hypothesize  that  the  two  groups  will  have  about  the  same  prevalence  of  abstinence, 
but  that  one  group  will  have  a  considerably  higher  proportion  who  have  more  than  one 
drink  a  day.  Even  though  the  response  variable  is  ordinal,  explain  why  a  cumulative 
logit  model  with  proportional  odds  structure  may  be  inadequate  for  this  study. 

8.12  Refer  to  Table  8.14.  Treating  belief  in  heaven  as  ordinal,  fit  and  interpret  a  (a) 
cumulative  logit  model  and  (b)  cumulative  probit  model.  Compare  results  and  state 
interpretations  in  each  case. 

8.13  For  the  cumulative  probit  model  fitted  to  Table  8.5,  find  the  means  and  standard 
deviation  for  the  two  normal  cdf’s  that  provide  the  curves  for  P(Y  =  1 )  as  a  function 
of  X]  =  number  of  traumatic  events,  at  the  two  levels  of  x2  =  race.  Interpret  the 
effects. 

8.14  For  Table  8.5,  fit  and  interpret  effects  for  a  (a)  cumulative  link  model  with  comple¬ 
mentary  log-log  link  and  (b)  continuation-ratio  logit  model. 
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8.15  Refer  to  Exercise  8.7.  With  adjacent-categories  logit  model  (8.10),  /§  —  0.435.  In¬ 
terpret  using  odds  ratios  for  adjacent  categories  and  for  the  (very  liberal,  very 
conservative)  pair  of  categories. 

8.16  For  the  developmental  toxicity  data  in  Table  8.7,  formulate  and  fit  a  continuation-ratio 
logit  model  with  proportional  odds  structure.  [Hint:  Create  a  data  file  of  independent 
binomials  and  then  construct  a  model  matrix  that  has  the  desired  model  structure.] 

8.17  Table  8.21  refers  to  a  study  that  randomly  assigned  subjects  to  a  control  or  treatment 
group.  Daily  during  the  study,  treatment  subjects  ate  cereal  containing  psyllium.  The 
study  analyzed  the  effect  on  LDL  cholesterol. 

a.  Model  the  ending  cholesterol  level  as  a  function  of  treatment,  using  the  beginning 
level  as  a  covariate.  Interpret  the  treatment  effect. 

b.  Repeat  part  (a),  now  treating  the  beginning  level  as  qualitative.  Compare  results. 


Table  8.21  Data  for  Exercise  8.17  on  Cholesterol  and  Cereal 


Beginning 

Ending  LDL  Cholesterol  Level 

Control 

Treatment 

<  3.4 

34-4.1 

4. 1-4.9 

>  4.9 

34 

34-4.1 

4. 1-4.9 

>  4.9 

5  3.4 

18 

8 

0 

0 

21 

4 

2 

0 

3.4-4. 1 

16 

30 

13 

2 

17 

25 

6 

0 

4. 1-4.9 

0 

14 

28 

7 

11 

35 

36 

6 

>  4.9 

0 

2 

15 

22 

1 

5 

14 

12 

Source:  Data  courtesy  of  Sallee  Anderson.  Kellogg  Co. 


8.18  The  book’s  website  (www.  stat .  uf  1 .  edu/~aa/cda/cda  .  html)  has  a  3  x  4  x  4 
table  that  cross-classifies  dumping  severity  (T)  and  operation  ( X )  for  four  hospitals 
(H).  The  four  operations  refer  to  treatments  for  duodenal  ulcer  patients  and  have 
a  natural  ordering.  Dumping  severity  describes  a  possible  undesirable  side  effect 
of  the  operation.  Its  three  categories  are  also  ordered.  Table  8.22  shows  results  of 
generalized  CMH  tests.  For  each  test,  give  a  pair  of  models  such  that  a  likelihood- 
ratio  test  comparing  those  models  would  give  similar  results.  Explain  how  one  test 
can  be  much  more  significant  than  the  others. 


Table  8.22  Results  for  Dumping  Severity  Data  of  Exercise  8.18 


Statistic 

Summary  Statistics  for  dumping  by 
Controlling  for  hospital 
Alternative  Hypothesis  DF 

operate 

Value 

Prob 

1 

Nonzero  Correlation 

1 

6.3404 

0 . 0118 

2 

Row  Mean  Scores  Differ 

3 

6 . 5901 

0 .0862 

3 

General  Association 

6 

10 . 5983 

0 .1016 
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8.19  A  sample  of  subjects  indicate  their  favorite  among  four  Margarita  pizzas  character¬ 
ized  by  1  =  (thin  crust,  normal  cheese),  2  =  (thin  crust,  extra  cheese),  3  =  (thick 
crust,  normal  cheese),  4  =  (thick  crust,  extra  cheese).  For  the  characteristics  of  the 
choices  x\  =  crust  type  ( 1  =  thick,  0  =  thin)  and  jc2  =  cheese  quantity  (1  =  extra, 
0  =  normal),  the  multinomial  discrete  choice  model  (8.20)  has  ji\  =  —0.40  and 
fii  —  0.60.  For  each  pizza  type,  find  the  probability  that  it  is  the  favorite. 

8.20  Refer  to  the  previous  exercise.  For  a  random  sample  of  20  pizza  lovers,  suppose  4 
prefer  choice  1,  8  prefer  choice  2,  3  prefer  choice  3,  and  5  prefer  choice  4.  Fit  the 
model  and  interpret  the  estimates. 

8.21  Describe  an  application  in  which  a  discrete-choice  model  would  be  useful.  Specify 
potential  explanatory  variables,  and  identify  which  are  characteristics  of  the  chooser 
and  which  are  characteristics  of  the  choices. 

8.22  A  cafe  has  four  entrees:  chicken,  beef,  fish,  vegetarian.  Specify  a  model  of  form 
(8.20)  for  the  selection  of  an  entree  using  x  =  gender  (1  =  female,  0  =  male)  and 
w  =  cost  of  entree,  which  is  a  characteristic  of  the  choices.  Interpret  the  model 
parameters. 

8.23  For  Table  8. 14  on  belief  in  heaven,  use  Bayesian  methods  to  fit  the  model  of  Exercise 
8.1.  Do  this  once  with  uninformative  priors  (say,  a  —  100)  and  once  with  very 
informative  priors  (say,  a  —  1).  In  each  case,  for  the  gender  effect  on  the  (yes/no) 
logit,  report  the  posterior  mean  and  standard  deviation  and  the  95%  posterior  interval. 
Compare  results  between  them  and  with  the  ML  estimate,  SE,  and  95%  confidence 
interval. 

8.24  In  the  previous  exercise,  treat  belief  in  heaven  as  ordinal  and  reanalyze  with  Bayesian 
methods.  Compare  results  for  the  gender  effect,  and  interpret. 

8.25  Consider  the  baseline-category  logit  model  of  Section  8.6.4  for  Bayesian  modeling 
of  alligator  food  choice  in  terms  of  size  and  lake.  Try  to  replicate  results  in  Table  8. 13 
for  a  —  1.  (If  your  results  differ  much,  for  your  parameterization  the  1 0  conditional 
log  odds  ratios  relating  size  to  pairs  of  food  choices  may  not  all  have  prior  a  =  1 .) 

8.26  Is  political  ideology  associated  with  happiness?  Conduct  a  Bayesian  analysis  for  the 
data  in  Table  3.7,  using  a  model  presented  in  this  chapter.  Present  a  posterior  interval 
and  posterior  probability  that  addresses  the  question,  and  interpret  results. 

8.27  Analyze  Table  8.5  with  two  types  of  model  studied  in  this  chapter.  Write  a  report 
summarizing  results  and  advantages  and  disadvantages  of  each  modeling  strategy. 

8.28  This  book’s  website  has  a4x2x3x3  table  that  cross-classifies  a  sample  of 
residents  of  Copenhagen  on  type  of  housing  (H),  degree  of  contact  with  other 
residents  (C),  feeling  of  influence  on  apartment  management  (/),  and  satisfaction 
with  housing  conditions  (5).  Treating  S  as  the  response  variable,  analyze  these  data. 


336 


MODELS  FOR  MULTINOMIAL  RESPONSES 


Theory  and  Methods 

8.29  A  multivariate  generalization  of  the  exponential  dispersion  family  (4. 1 7)  is 


—  expHy/d,  -  6(d/)]/n(0)  +  c(y,,  </>)}, 


where  0,  is  the  natural  parameter.  Show  that  a  multinomial  variate  y,  for  a  single 
trial  with  parameters  {i ij,  j  =  1 —  1 )  is  in  the  (J  —  1  )-parameter  exponential 
family,  with  baseline-category  logits  as  natural  parameters. 

8.30  Cell  counts  {y,;}  in  an  /  x  J  contingency  table  have  a  multinomial  («;  {it ,7})  distri¬ 
bution.  Show  that  {P(Yjj  =  n,j )}  can  be  expressed  as 


^nriK!)  'exp 

i  j 


/- 1  J- 1 

EE  n,j  log(Q!;y  ) 

/=!  j=\ 


/-]  7-1 

+  EW/+  loS (Kij/nu)  +  Y2n+i  \og(7iij/nu) 

i= I  i=l 


where  <2,7  =  tt,-^ 7T / j /KjjK/j  and  d  is  a  constant  independent  of  the  data.  Find  an 
alternative  expression  using  local  odds  ratios  {Ojj},  by  showing  that 


EE  nij  log  a, 7  =  EE  Sij  log  6jj ,  where  5,7  —  EE 

/  j  i  j  a<i  b<j 

(Hence,  models  for  such  parameters  have  reduced  sufficient  statistics  and  relatively 
simple  score  statistics  for  testing  effects.) 


8.31  Consider  the  baseline-category  logit  model  expressed  as 

exp(a,  +  pTjX) 

7T/U)  =  —7 - - - • 

E/,=  leXP(“/>  +PhX) 

Show  that  dividing  numerator  and  denominator  by  exp(a./  +  P]x)  yields  new  pa¬ 
rameters  a*  =  a  j  —  aj  and  ft*  =  Pj  —  pj  that  satisfy  aj  =  0  and  Pj  =  0.  Thus, 
without  loss  of  generality,  we  can  take  aj  —  0  and  Pj  =  0. 


8.32  When  there  are  J  =  3  outcome  categories,  suppose  that 


n  j(x)  =  exp(ay  +  pjx)/[\  +  exp(<2|  +  pxx )  +  exp(a2  +  Pix)], 

j  =  1,2.  Show  that  7T3(x)  is  (a)  decreasing  in  x  if  p\  >0  and  pi  >  0,  (b)  increasing 
inxif/81  <  0and/)2  <  0,  and  (c)  nonmonotone  when  pi  and  p2  have  different  signs. 

8.33  Refer  to  the  log-likelihood  function  for  the  baseline-category  logit  model  (Sec¬ 
tion  8.1.4).  Denote  the  sufficient  statistics  by  npj  =  E;  >’/ /  and  Sjt  =  Ylixikyij^ 
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j  =  1 ,...,/,  k  =  1, . . . ,  p.  Let  S  =  (Sn,. . ,,  Sip,. . .  Sj i,...,  S.ip)T .  Under  the 
null  hypothesis  that  explanatory  variables  have  no  effect,  conditional  on  JT 
j  =  1 ,...,/ ,  show  that 

E(S)  =  n(p  <g>  m),  var(S)  =  n(V  ®  E), 

where  p  =  (pi, . . . ,  pj)T ,  m  =  . . . xp)r  with  x*  =  (£,-  x,*)  / n ,  E  has  ele¬ 

ments  =  [5^,-(x/t  —  x*)(x,-v  —  x,.)]  /(«  —  1),  and  V  has  elements  v„  =  p,(l  — 
Pi)  and  Vjj  —  —piPj  (Zelen  1991). 


8.34  An  alternative  fitting  approach  for  the  baseline-category  logit  model  (8. 1 )  fits  binary 
logistic  models  separately  for  the  J  —  1  pairings  of  responses.  The  estimates  have 
larger  SE  than  the  ML  estimates  for  simultaneous  fitting  of  the  J  —  1  logits,  but 
Begg  and  Gray  (1984)  showed  that  the  efficiency  loss  is  minor  when  the  response 
category  having  highest  prevalence  is  the  baseline.  Illustrate,  by  showing  that  the 
fit  using  categories  I  and  F  alone  of  the  alligator  data  is  log(7T/ /jt/t)  =  —1.69  + 
1.66s  —  1.78zw  +  1.05zo  +  1.22zj,  withSf  values  (0.43,  0.62,  0.49,  0.52)  for  the 
effects.  Compare  with  the  first  row  of  Table  8.4. 

8.35  For  explanatory  variable  k  in  a  baseline-category  logit  model,  suppose  the  model 
matrix  constrains  fck  =  ■  •  •  =  fijk  =  0,  leaving  f)]k  unconstrained.  Explain  how  jS\k 
then  describes  a  contrast  for  that  variable  between  outcome  category  1  and  the  other 
categories  combined.  Explain  how  to  generalize  this  to  contrast  one  subset  of  the 
categories  to  the  other  categories. 

8.36  Explain  why  the  cumulative  logit  model  of  proportional  odds  form  is  not  a  special 
case  of  a  baseline-category  logit  model. 

8.37  Consider  the  cumulative  logit  model,  logit[/’(T  <  ;)]  =  a,  +  fijx,  not  having  pro¬ 
portional  odds  form. 

a.  With  continuous  x  taking  values  over  the  real  line,  show  that  the  model  is  improper 
in  that  cumulative  probabilities  are  misordered  for  a  range  of  x  values. 

b.  When  x  is  a  binary  indicator,  explain  why  the  model  is  proper  but  requires 
constraints  on  (ay  +  /3y)  (as  well  as  the  usual  ordering  constraint  on  {ay})  and  is 
then  equivalent  to  the  saturated  model. 

8.38  For  an  /  x  J  contingency  table  with  ordinal  Y  and  scores  {x,  =  /},  consider  the 
model 


logit[F*(T  <  j \X  =  x/)]  =  ay  +  /8x,-.  (8.23) 

a.  Show  that  logit[P(T  <  j |X  =  x,+i)]  —  logit[/’(T  <  j\X  =  x, )]  =  is  a  log  cu¬ 
mulative  odds  ratio  for  the  2x2  table  consisting  of  rows  i  and  i  +  1  and  the 
binary  response  having  cutpoint  following  category  j.  Thus,  (8.23)  is  a  uniform 
association  model  in  cumulative  odds  ratios. 

b.  Show  that  (i)  residual  df  =  I J  —  I  —  J  and  (ii)  f>  —  0  corresponds  to  indepen¬ 
dence  of  X  and  Y. 
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c.  Using  the  same  linear  predictor  but  with  adjacent-categories  logits,  show  that 
uniform  association  applies  to  the  local  odds  ratios  (2. 1 0). 

8.39  A  cumulative  link  model  for  an  /  x  J  table  with  a  qualitative  predictor  is 

G~l[P(Y  <j)]  =  aj  +  iM,  /  =  1 . /,  y  =  1 . y  —  1. 

Show  that  (a)  residual  df  =  (/  —  1)(7  —2),  (b)  independence  corresponds  to 
/ri  =  ■  ■  ■  —  fij ,  (c)  the  test  of  independence  has  df  =  /  —  1,  and  (d)  the  rows 
are  stochastically  ordered  on  Y. 

8.40  Prove  factorization  (8. 14)  for  the  multinomial  distribution. 

8.41  A  response  scale  has  the  categories  (strongly  agree,  mildly  agree,  mildly  disagree, 
strongly  disagree,  don’t  know).  One  model  uses  a  logistic  part  for  P(don’t  know) 
and  a  separate  ordinal  part  for  the  ordered  categories  conditional  on  response  in 
one  of  those  categories.  Explain  how  to  construct  a  likelihood  function  to  do  this 
simultaneously. 

8.42  For  cumulative  link  model  (8.7),  show  that  for  1  <  j  <  k  <  J  —  1,  P(Y  <  A  |x)  = 
P(Y  <  j\x*),  where  x*  is  obtained  by  increasing  the  ;th  component  of  x  by  (cr*  — 
aj)/fii.  Interpret. 

8.43  When  X  and  Y  are  ordinal,  explain  how  to  test  conditional  independence  by  allowing 
a  different  trend  in  each  partial  table.  [Hint:  Generalize  model  (8.17)  by  replacing  f5 
by  ft.] 

8.44  Consider  equation  (8.21)  and  the  condition  of  independence  from  irrelevant  alterna¬ 
tives.  Explain  why  this  condition  does  not  hold  for  the  multinomial  probit  model. 

8.45  For  a  Bayesian  analysis,  explain  why  the  posterior  P(ft  <  0)  is  analogous  to  the 
frequentist  P-value  for  Ha :  ft  >0. 

8.46  After  fitting  a  cumulative  logit  model  of  proportional  odds  form,  what  might  you  do 
to  check  the  model  (a)  as  part  of  a  frequentist  analysis  and  (b)  as  part  of  a  Bayesian 
analysis? 


CHAPTER  9 


Loglinear  Models  for  Contingency  Tables 


In  Section  4.3  we  introduced  loglinear  models  as  generalized  linear  models  (GLMs)  using 
the  log  link  function  with  a  Poisson  response.  A  common  use  is  modeling  for  contingency 
tables.  The  models  specify  the  joint  distribution  among  the  categorical  variables  that  are 
cross-classified  to  form  the  table.  They  are  used  to  analyze  association  and  interaction 
patterns  among  those  variables.  The  models  specify  how  the  expected  cell  counts  depend 
on  levels  of  the  categorical  variables  for  that  cell  as  well  as  associations  and  interactions 
among  those  variables. 

We  present  loglinear  models  in  Section  9.1  for  two-way  contingency  tables,  in  Sec¬ 
tions  9.2  and  9.3  for  three-way  tables,  and  in  Section  9.4  for  multiway  tables.  Loglinear 
models  are  of  use  primarily  when  at  least  two  variables  are  response  variables.  With  a  single 
categorical  response,  it  is  simpler  and  more  natural  to  use  logistic  regression  models.  When 
one  variable  is  treated  as  a  response  and  the  others  as  explanatory  variables,  logistic  models 
for  that  response  variable  are  equivalent  to  certain  loglinear  models.  Section 9.5  presents 
this  connection.  In  Sections 9.6  and  9.7  we  discuss  loglinear  model  fitting. 

9.1  LOGLINEAR  MODELS  FOR  TWO-WAY  TABLES 

Consider  an  /  x  J  contingency  table  that  cross-classifies  a  multinomial  sample  of  n  subjects 
on  two  categorical  responses.  For  cell  probabilities  { tt,, } ,  the  expected  frequencies  are 
{ /x jj  —  njZjj}.  Loglinear  model  formulas  use  {/Xy}  rather  than  { 7T(y } ,  so  they  also  apply  with 
Poisson  sampling  for  N  =  /./  independent  cell  counts  {T,/}  having  {/Xy  =  E(Ty)}.  In  either 
case  we  denote  the  observed  cell  counts  by  {«,/}. 

9.1.1  Independence  Model  for  a  Two-Way  Table 

Under  statistical  independence,  the  {/Xy}  have  the  structure 

/X jj  =  IlUj  pj 

(Section  4.3.7).  For  multinomial  sampling,  for  instance,  /x,y  =  «tt,+  jt+/.  Denote  the  row 
variable  by  X  and  the  column  variable  by  Y.  The  formula  expressing  independence  is 
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multiplicative.  Thus,  log  pij  has  additive  form1 

log  mj  -  A  +  kf  +  Xrj  (9.1) 

for  a  row  effect  Af  and  a  column  effect  A f. .  This  is  the  loglinear  model  of  independence. 
Identifiability  requires  constraints  such  as  A*  =  kYj  =  0  or  JT  kf  =  Jfj  Ay  =  0. 

The  ML  fitted  values  are  [fry  =  m+n+j/n],  the  estimated  expected  frequencies  for  chi- 
squared  tests  of  independence.  The  tests  using  X 2  and  G  2  (Section  3 .2 . 1 )  are  goodness-of-fit 
tests  of  this  loglinear  model. 

9.1.2  Interpretation  of  Loglinear  Model  Parameters 

Loglinear  models  for  contingency  tables  are  GLMs  that  treat  the  N  cell  counts  as  indepen¬ 
dent  observations  of  a  Poisson  random  component.  Loglinear  GLMs  identify  the  data  as 
the  N  cell  counts  rather  than  the  individual  classifications  of  the  n  subjects.  The  expected 
cell  counts  link  to  the  explanatory  terms  using  the  log  link.  As  (9.1)  illustrates,  of  the 
cross-classified  variables,  the  model  does  not  distinguish  between  response  and  explana¬ 
tory  variables.  It  treats  both  jointly  as  responses,  modeling  {/L,y}  for  combinations  of  their 
levels.  To  interpret  parameters,  however,  it  is  helpful  to  treat  the  variables  asymmetrically. 
We  illustrate  with  the  independence  model  for  /  x  2  tables.  In  row  i,  the  logit  equals 


logit[P(T  =  1|X  =  /)]  =  log 


P(Y  =  1|X  =t) 
P(Y  =  2\X  =  i) 


=  log  —  =  log  Pi  | 
Pi  2 


log  Pi2 


—  (A  +  kf  +  k\)  —  (A  +  kf  4-  A^ )  —  k\  —  k 2  ■ 


The  final  term  does  not  depend  on  /;  that  is,  logit[P(T  =  1|X  =  ;)]  is  identical  at  each 
level  of  X.  Thus,  independence  implies  a  model  of  form  logit[P(T  =  1|X  =  /)]  =  a.  In 
each  row,  the  odds  of  response  in  column  1  equal  exp(a)  =  exp(A ,  —  k\ ).  When  A[  =  kY2 , 
the  model  simplifies  to  log  /x,y  =  A  +  A(x  and  there  is  equiprobability,  with  P(Y  —  1 1 X  = 
i)  —  P(Y  =  2\X  =  /')  in  each  row. 

An  analogous  property  holds  when  J  >  2.  Differences  between  two  parameters  for  a 
variable  relate  to  the  log  odds  of  making  one  response,  relative  to  the  other,  on  that  variable. 

9.1.3  Saturated  Model  for  a  Two-Way  Table 

Statistically  dependent  variables  satisfy  a  more  complex  loglinear  model, 

log  Pij  =  A  +  kf  +  Ay  +  kfjY.  (9.2) 

The  (A^t  }  are  association  terms  that  reflect  deviations  from  independence.  The  right-hand 
side  of  (9.2)  resembles  the  formula  for  cell  means  in  two-way  ANOVA.  allowing  interaction. 

1  The  X  and  Y  superscripts  represent  the  variables  and  are  not  exponents. 
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The  {XXY)  represent  interactions  between  X  and  Y,  whereby  the  effect  of  one  variable  on  /!,, 
depends  on  the  level  of  the  other.  The  independence  model  (9.1)  results  when  all  XfY  =  0. 

With  constraints  Xf  =  XY  =  0  in  (9.1)  and  (9.2),  (Xf)  and  {Aj )  are,  equivalently, 
coefficients  of  indicator  variables  for  the  first  (7  —  1)  categories  of  X  and  for  the  first 
(7  —  1)  categories  of  Y.  Thus,  XfY  is  the  coefficient  of  the  product  of  indicator  vari¬ 
ables  for  X f  and  XY .  Since  there  are  (/  —  1  )(J  —  1)  such  cross-products,  XXY  =  Xff  =  0, 
and  only  (/  —  1)(7  —  1)  of  these  parameters  are  nonredundant.  Tests  of  independence 
analyze  whether  these  (/  —  1)(7  —  1)  parameters  equal  zero,  so  they  have  residual 
df  =  (/  —  1)(7  —  1). 

The  number  of  parameters  in  model  (9.2)  equals  1  +  (/  —  1)  +  (/  —  1)  +  (/  —  1  )(7  — 
1)  =  I J ,  the  number  of  cells.  Hence,  this  model  describes  perfectly  any  {fly  >  0}  (see 
Exercise  9.16).  The  ML  fitted  values  are  {/2,y  =  jj,y).  It  is  the  most  general  model  for  two- 
way  contingency  tables,  the  saturated  model.  For  it,  direct  relationships  exist  between  log 
odds  ratios  and  {XXY ).  For  instance,  for  2  x  2  tables, 

fJL  \  i  fX  22 

log  e  =  log - —  =  log  IM 1  +  log  tin  -  log  tin  -  log  ti 2! 

Ml2  Al2l 

=  (X  +  X*  +  X]  +  A.ff)  +  {x  +  X*  +  +  Xx2Y) 

-  (A.  +  X?  +  Xy2  +  A.f2y)  -  (A.  +  Xl  +  X\  +  a 21^) 

=  XfT  +  AjT-^-X5r.  (9.3) 

Thus,  (7.^K)  determine  the  association. 

In  practice,  unsaturated  models  are  preferable,  since  their  fit  smooths  the  sample  data 
and  has  simpler  interpretations.  For  tables  with  at  least  three  variables,  unsaturated  models 
can  include  association  terms.  Then,  loglinear  models  are  more  commonly  used  to  describe 
associations  (through  two-factor  terms)  than  to  describe  odds  (through  single-factor  terms). 

9.1.4  Alternative  Parameter  Constraints 

As  with  the  independence  model,  the  parameter  constraints  for  the  saturated  model  are 
arbitrary.  Instead  of  setting  all  XfY  —  Xff  =  0,  we  could  set  Xx>  =  ^  .  XfY  =  0  for 
all  i  and  j.  Different  software  uses  different  constraints.  What  is  unique  and  estimable  are 
contrasts  such  as  XXY  +  XXY  —  XXY  —  X\\  in  (9.3)  that  determine  odds  ratios. 

For  instance,  suppose  that  a  log  odds  ratio  equals  2.0  in  a  2  x  2  table.  With  the  first  set 
of  constraints,  2.0  is  the  coefficient  of  a  product  of  an  indicator  variable  indicating  the  first 
category  of  X  and  an  indicator  variable  indicating  the  first  category  of  Y.  With  it,  /.']  =  2.0 
and  XXY  =  XXJ  =  XXY  =  0.  For  sum-to-zero  constraints,  XfY  —  XXY  —  0.5,  XfY  —  X^f  — 
—0.5.  For  either  set,  the  log  odds  ratio  (9.3)  equals  2.0. 

9.1.5  Hierarchical  Versus  Nonhierarchical  Models 

Like  other  models  in  this  book,  model  (9.2)  is  hierarchical.  This  means  that  the  model 
includes  all  lower-order  terms  composed  from  variables  contained  in  a  higher-order  term. 
When  the  model  contains  XXY ,  it  also  contains  kf  and  a1  .  A  reason  for  including  lower-order 
terms  is  that,  otherwise,  the  statistical  significance  and  the  interpretation  of  a  higher-order 
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term  depends  on  how  variables  are  coded.  This  is  undesirable,  and  with  hierarchical  models 
the  same  results  occur  no  matter  how  variables  are  coded. 

An  example  of  a  nonhierarchical  model  is 

log  ix, j  =  X  +  X?  +  Xf. 


This  model  permits  association  but  forces  unnatural  behavior  of  expected  frequencies,  with 
the  pattern  depending  on  constraints  used  for  parameters.  For  instance,  with  constraints 
whereby  parameters  are  zero  at  the  last  level,  log  ix/j  =  X  in  every  column.  Nonhierarchical 
models  are  rarely  sensible  in  practice.  Using  them  is  analogous  to  using  ANOVA  or 
regression  models  with  interaction  terms  but  without  the  corresponding  main  effects. 

When  a  model  has  two-factor  terms,  interpretations  focus  on  them  rather  than  on  the 
single-factor  terms.  By  analogy  with  two-way  ANOVA  with  two-factor  interaction,  it  can 
be  misleading  to  report  main  effects.  The  estimates  of  the  main-effect  terms  depend  on  the 
coding  scheme  used  for  the  higher-order  effects,  and  the  interpretation  also  depends  on  that 
scheme  (see  Exercise  9.16).  Normally,  we  restrict  our  attention  to  the  highest-order  terms 
for  a  variable,  as  we  illustrate  in  Section  9.2. 


9.1.6  Multinomial  Models  for  Cell  Probabilities 

Conditional  on  the  sum  n  of  the  cell  counts,  Poisson  loglinear  models  for  {/u,/}  become 
multinomial  models  for  cell  probabilities  {jry  —  yU/)/(X]«  To  illustrate,  for  the 

saturated  model. 


exp(k  +  kf+Xrj+X^) 

=  E.E,exp(i  +  x;  +  xJ+iX) '  (9,4) 

The  X  intercept  parameter  cancels  in  the  multinomial  model  (9.4).  This  parameter  relates 
to  the  total  sample  size,  which  is  random  in  the  Poisson  model  but  not  in  the  multinomial 
model.  So,  the  saturated  multinomial  model  has  I J  —  \  parameters,  representing  the  usual 
constraint  for  probabilities,  JV  Yl,  nij  =  *  • 


9.2  LOGLINEAR  MODELS  FOR  INDEPENDENCE  AND 
INTERACTION  IN  THREE-WAY  TABLES 

In  Section  2.3  we  introduced  structure  for  three-way  contingency  tables,  such  as  condi¬ 
tional  independence  and  homogeneous  association.  Loglinear  models  for  three-way  tables 
describe  these  independence  and  association  patterns. 


9.2.1  Types  of  Independence 

A  three-way  /  x  J  x  K  cross-classification  of  response  variables  X,  Y.  and  Z  has  several 
potential  types  of  independence.  The  models  apply  to  a  multinomial  distribution  with 
cell  probabilities  {jr,^}  having  JV  JV-  Ylk71^  =  1-0  and  also  to  Poisson  sampling  with 
means  {n,jk}. 
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Mutual  independence:  The  three  variables  are  mutually  independent  when 

Ttjjk  =  7T/++  x+j+  x++k  for  all  i,  j,  and  k.  (9.5) 

For  expected  frequencies  {pijk},  mutual  independence  has  loglinear  form 

log  H-ijk  =  A.  +  A .f  +  XYj  +  A.z .  (9.6) 

Joint  independence:  Variable  Y  is  jointly  independent  of  X  and  Z  when 

Ttjjk  =  tt j+k  Tt+j+  for  all  /,  j,  and  k.  (9.7) 

This  is  ordinary  two-way  independence  between  Y  and  a  variable  composed  of  the  IK 
combinations  of  levels  of  X  and  Z.  The  loglinear  model  is 

log  p-ijk  =  X  +  X?  +  XYj  +  A.z  +  X*k  .  (9.8) 

Similarly,  X  could  be  jointly  independent  of  Y  and  Z,  or  Z  could  be  jointly  independent  of 
X  and  Y .  Mutual  independence  (9.5)  implies  joint  independence  of  any  one  variable  from 
the  other  two. 

Conditional  independence:  Categorical  variables  X  and  Y  are  conditionally  indepen¬ 
dent,  given  Z,  when  independence  holds  for  each  partial  table  within  which  Z  is  fixed.  That 
is,  if  Ttjj\k  —  P(X  —  i,Y  =  yjZ  =  k),  then 


ttij\k  =  tt,+\k  tt+j\k  for  all  /,  j,  and  k. 

For  joint  probabilities  over  the  entire  table,  equivalently 

ttjjk  -  ttj+k  7t+jk/7t++k  for  all  /,  j,  and  k.  (9.9) 

Conditional  independence  of  X  and  Y,  given  Z,  has  loglinear  model  form 

log  P-ijk  =  A.  +  A .f  +  XYj  +  A.f  +  A^z  +  Xjf .  (9. 10) 

This  is  a  weaker  condition  than  mutual  or  joint  independence.  Mutual  independence  implies 
that  Y  is  jointly  independent  of  X  and  Z,  which  itself  implies  that  X  and  Y  are  conditionally 
independent.  Table  9.1  summarizes  these  three  types  of  independence. 


Table  9.1  Loglinear  Independence  Models  for  Three-Dimensional  Tables 


Model 

Probabilistic 

Association  Terms 

Formula 

Form  for  7tjjt 

in  Loglinear  Model 

Interpretation 

(9.6) 

tti++  TT+j-k  Tt++k 

None 

Mutual  independence  of  X,  Y,  Z 

(9.8) 

ttj+k  tt+j+ 

\XZ 

Aik 

Joint  independence  of  Y  from  X  and  Z 

(9.10) 

ttl+k  tt+jk/tt++k 

1 XZ  1  T  YZ 

Aik  +  Ajk 

Conditional  independence  of  X  and  Y ,  given  Z 
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Figure  9.1  Relationships  among  types  of  XY  independence. 


Recall  that  conditional  associations  can  be  quite  different  from  marginal  associations 
(Section  2.3.2).  For  instance,  conditional  independence  does  not  imply  marginal  indepen¬ 
dence.  Conditional  independence  and  marginal  independence  both  hold  when  one  of  the 
stronger  types  of  independence  studied  above  applies.  Figure  9. 1  summarizes  relationships 
among  the  four  types  of  independence. 

9.2.2  Homogeneous  Association  and  Three-Factor  Interaction 

Loglinear  models  (9.6),  (9.8),  and  (9.10)  have  three,  two,  and  one  pair  of  conditionally 
independent  variables,  respectively.  In  the  latter  two  models,  the  doubly  subscripted  terms 
(such  as  X?jV)  pertain  to  conditionally  dependent  variables.  A  model  that  permits  all  three 
pairs  to  be  conditionally  dependent  is 

l°g  I1  ijk  =  A.  +  A*  +  XYj  +  kf  +  XfY  +  X*k  +  XYZ .  (9. 1 1 ) 

From  exponentiating  both  sides,  the  cell  probabilities  have  form 

tt ijk  =  f  ijfjk  W ik  • 

No  closed-form  expression  exists  for  the  three  components  in  terms  of  margins  of  {n^} 
except  in  certain  special  cases  (see  Note  10.2). 

For  this  model,  in  the  next  section  we  show  that  conditional  odds  ratios  between  any 
two  variables  are  identical  at  each  category  of  the  third  variable.  That  is,  each  pair  has  ho¬ 
mogeneous  association,  as  first  defined  for  2  x  2  x  K  tables  in  Section  2.3.5.  Model  (9.11) 
is  called  the  loglinear  model  of  homogeneous  association  or  of  no  three-factor  interaction. 
The  general  loglinear  model  for  a  three-way  table  is 

log  mjk  =  X  +  k*  +  XY  +  Xzk  +  kf  +  kf  +  XYZ  +  xfkz .  (9.12) 

With  indicator  variables,  a*Jv-  is  the  coefficient  of  the  product  of  the  ith  indicator  variable 
for  X,  y'th  indicator  variable  for  Y,  and  £th  indicator  variable  for  Z.  The  total  number  of 
nonredundant  parameters  is 


1  +(/  -  l)  +  (/  -  1)  +  (AT-  !)  +  (/  -  1 )(/  —  1)  +  (/  —  1  )(K  -  1) 
+  (/-l)(AT-l)  +  (/-  1)(/  -  1  )(K  -  1)  =  IJK, 
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Table  9.2  Loglinear  Models  for  Three-Dimensional  Tables 


Loglinear  Model  Formula  Symbol 


log 

M„,  = 

=  k 

+ 

+ 

+ 

\Z 

(X,  Y,  Z) 

log 

= 

=  k 

+ 

*■* 

+ 

+ 

\Z 

Kk 

+ 

(XY,  Z) 

log 

frit  = 

=  k 

+ 

k* 

+ 

+ 

Ak 

+ 

+ 

\YZ 

kjk 

(XY,  YZ) 

log 

V-ijt  = 

=  k 

+ 

k* 

+ 

+ 

\z 

Ak 

+ 

+ 

i  rz 
kjk 

(XY,  YZ,  XZ) 

log 

M/rf  = 

=  k 

+ 

*■? 

+ 

+ 

\z 

Kk 

+ 

+ 

kYZ 

kjk 

+  kf  +  X^ 

(XYZ) 

which  is  the  total  number  of  cell  counts.  This  model  has  as  many  parameters  as  observa¬ 
tions  and  is  saturated.  It  describes  all  possible  {/Xy*  >  0}.  Each  pair  of  variables  may  be 
conditionally  dependent,  and  an  odds  ratio  for  any  pair  may  vary  across  categories  of  the 
third  variable. 

Setting  certain  parameters  equal  to  zero  in  (9. 1 2)  yields  the  models  introduced  previously. 
Table  9.2  lists  some  of  these  models.  To  ease  referring  to  models,  this  table  assigns  to  each 
model  a  symbol  that  lists  the  highest-order  term(s)  for  each  variable.  For  instance,  the 
model  (9.10)  of  conditional  independence  between  X  and  Y  has  symbol  (XZ,  YZ),  since 
its  highest-order  terms  are  kfz  and  kjz.  In  the  notation  we  used  for  logistic  models  in 
Sections  6. 1  and  8. 1 .2  this  stands  for  (X*Z  +  Y*Z),  which  is  itself  shorthand  for  notation 
[X  +  Y  +  Z  +  (X  ■  Z)  +  (Y  ■  Z)\  that  has  the  main  effects  as  well  as  two  interactions. 


9.2.3  Interpretation  of  Loglinear  Model  Parameters 

Interpretations  of  loglinear  model  parameters  use  their  highest-order  terms.  For  instance, 
interpretations  for  model  (9.11)  use  the  two- factor  terms  to  describe  conditional  odds  ratios. 
At  a  fixed  level  k  of  Z,  the  conditional  association  between  X  and  Y  uses  (7  —  1 )(J  —  1 ) 
odds  ratios,  such  as  the  local  odds  ratios 

em=  'T/a'T'r'.7-2i  .  1  </</_!,  1  <  y  <  7  —  1 .  (9.13) 

^7 .j+\,k  ^i^-\.j.k 

Similarly,  (/  -  1)(A"  -  1)  odds  ratios  describe  XZ  conditional  association,  and 

( J  —  1)(A"  —  1)  odds  ratios  {%);*}  describe  YZ  conditional  association.  Loglinear  models 
have  characterizations  using  constraints  on  conditional  odds  ratios.  For  instance,  conditional 
independence  of  X  and  Y  is  equivalent  to  {@y(X)  =  1 ,  /  =  1 , . . . ,  /  —  1 ,  j  =  1 —  1 , 
k  =  1, . . . ,  AT). 

The  two-factor  parameters  relate  directly  to  the  conditional  odds  ratios.  To  illustrate, 
substituting  (9. 1 1)  for  model  (XY,  XZ,  YZ)  into  log  0^)  yields 


log  6,m  =  log 


f^ijk  ^i+lJ-\-\,k 
M i+Ujk  l^i.j+\,k 


+  A. 


XY 

/+!,./+ 1 


-X 


XY 


-X 


XY 

i+U.r 


(9.14) 


Since  the  right-hand  side  is  the  same  for  all  k,  an  absence  of  three-factor  interaction  is 
equivalent  to 


0,7<i)  =  0,7(2)  =  •  ■  ■  =  %*)  for  all  i  and  j. 
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The  same  argument  for  the  other  conditional  odds  ratios  shows  that  model  (XY,  XZ,  YZ)  is 
also  equivalent  to 


d,(i)k  =  0«2)k  =  ■■■  =  Qi(j)k  for  all  i  and  k. 


and  to 


0(D jk  =  0(2) jk  =■■■  =  0(i) jk  for  all  j  and  k. 

Any  model  not  having  the  three-factor  interaction  term  has  a  homogeneous  association  for 
each  pair  of  variables. 

When  X  and  Y  have  two  categories,  only  one  nonredundant  XZY  parameter  occurs. 
Thus,  expression  (9.14)  simplifies  according  to  the  constraints.  By  the  same  argument  as 
in  Section  9.1.3  for  2  x  2  tables,  the  conditional  log  odds  ratio  simplifies  to  X^  with 
indicator- variable  constraints  setting  parameters  at  the  second  level  of  X  or  Y  equal  to  0. 

The  X^YZ  term  in  the  general  model  (9.12)  refers  to  three-factor  interaction.  It  describes 
how  the  odds  ratio  between  two  variables  changes  across  categories  of  the  third.  We  illustrate 
for  2  x  2  x  2  tables.  By  direct  substitution  of  the  general  model  formula, 

.  011(1)  .  (Mm  M22l)/(M|2I  M211) 

log -  =  log - 

011(2)  (Mll2  M222)/(Ml22  M212) 

_  /,  XYZ  ,  yXYZ  -,XYZ  \XYZ\ 

—  Vmi  t  a22,  -  a121  -  a2i1  ) 

t  \  XYZ  1  ,  XYZ  \  XYZ  \XYZ\ 

VAii2  +  a222  ai22  a2]2  )  . 

Only  one  parameter  is  nonredundant.  For  constraints  setting  the  second-category  parameters 
equal  to  0,  this  log  ratio  of  odds  ratios  equals  When  =  0,  9\ ku  =  6\  1(2),  giving 
homogeneous  XY  association. 

9.2.4  Example:  Alcohol,  Cigarette,  and  Marijuana  Use 

Table  9.3  refers  to  a  survey  by  the  Wright  State  University  School  of  Medicine  and  the 
United  Health  Services  in  Dayton,  Ohio.  The  survey  asked  2276  students  in  their  final  year 
of  high  school  in  a  nonurban  area  near  Dayton,  Ohio,  whether  they  had  ever  used  alcohol, 


Table  9.3  Alcohol,  Cigarette,  and  Marijuana  Use  for 
High  School  Seniors 


Alcohol 

Use 

Cigarette 

Use 

Marijuana  Use 

Yes 

No 

Yes 

Yes 

911 

538 

No 

44 

456 

No 

Yes 

3 

43 

No 

2 

279 

Source:  Data  courtesy  of  Harry  Khamis,  Wright  State  University. 


LOGLINEAR  MODELS  FOR  INDEPENDENCE  AND  INTERACTION  IN  THREE-WAY  TABLES  347 


Table  9.4  Fitted  Values  for  Loglinear  Models  Applied  to  Table  9.3 


, ,  ,  ,  _.  ....  Loglinear  Model 

Alcohol  Cigarette  Marijuana  - 


Use  (A) 

Use  (C) 

Use  ( M ) 

(A.C,  M) 

(AC,  M) 

(AM,  CM) 

(AC.  AM.  CM) 

(ACM) 

Yes 

Yes 

Yes 

540.0 

611.2 

909.24 

910.4 

911 

No 

740.2 

837.8 

438.84 

538.6 

538 

No 

Yes 

282.1 

210.9 

45.76 

44.6 

44 

No 

386.7 

289.1 

555.16 

455.4 

456 

No 

Yes 

Yes 

90.6 

19.4 

4.76 

3.6 

3 

No 

124.2 

26.6 

142.16 

42.4 

43 

No 

Yes 

47.3 

118.5 

0.24 

1.4 

2 

No 

64.9 

162.5 

179.84 

279.6 

279 

cigarettes,  or  marijuana.  Denote  the  variables  in  this  2x2x2  table  by  A  for  alcohol  use, 
C  for  cigarette  use,  and  M  for  marijuana  use. 

Section  9.7  covers  the  fitting  of  loglinear  models.  For  now,  we  emphasize  interpretation. 
Table  9.4  shows  fitted  values  for  several  loglinear  models.  The  fit  for  model  (AC,  AM,  CM) 
is  close  to  the  observed  data,  which  are  the  fitted  values  for  the  saturated  model  (ACM). 
The  other  models  fit  poorly. 

Table  9.5  illustrates  model  association  patterns  by  presenting  estimated  conditional  and 
marginal  odds  ratios.  For  example,  the  entry  1 .0  for  the  AC  conditional  association  for  the 
model  (AM,  CM)  of  AC  conditional  independence  is  the  common  value  of  the  AC  fitted 
odds  ratios  at  the  two  levels  of  M, 

!  909.24  x  0.24  438.84  x  179.84 

45.76  x  4.76  ~  555.16  x  142.16' 

The  entry  2.7  for  the  AC  marginal  association  for  this  model  is  the  odds  ratio  for  the 
marginal  AC  fitted  table.  The  odds  ratios  for  the  observed  data  are  those  reported  for  the 
saturated  model  (ACM). 

Table  9.5  shows  that  estimated  conditional  odds  ratios  equal  1 .0  for  each  pairwise  term 
not  appearing  in  a  model,  such  as  the  AC  association  in  model  (AM,  CM).  For  that  model, 
the  estimated  marginal  AC  odds  ratio  differs  from  1.0,  since  conditional  independence 
does  not  imply  marginal  independence.  Some  models  have  conditional  associations  that 


Table  9.5  Estimated  Odds  Ratios  for  Loglinear  Models  in  Table  9.4 


Conditional  Association  Marginal  Association 


Model 

AC 

AM 

CM 

AC 

AM 

CM 

(. A.C.M ) 

1.0 

1.0 

1.0 

1.0 

1.0 

1.0 

(AC.  M) 

17.7 

1.0 

1.0 

17.7 

1.0 

1.0 

(AM.  CM) 

1.0 

61.9 

25.1 

2.7 

61.9 

25.1 

(AC.  AM.  CM) 

7.8 

19.8 

17.3 

17.7 

61.9 

25.1 

(ACM)  level  1 

13.8 

24.3 

17.5 

17.7 

61.9 

25.1 

(ACM)  level  2 

7.7 

13.5 

9.7 
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are  necessarily  the  same  as  the  corresponding  marginal  associations.  In  Section  10.1.3  we 
present  a  condition  guaranteeing  this. 

Model  (AC,  AM,  CM)  permits  all  pairwise  associations  but  maintains  homogeneous 
odds  ratios  between  two  variables  at  each  level  of  the  third.  The  AC  fitted  conditional  odds 
ratios  for  this  model  equal  7.8.  We  can  calculate  this  odds  ratio  using  the  model’s  fitted 
values  at  either  level  of  M,  or  [from  (9.14)]  using  expfA.'j'f  +  —  —  A^f). 

Table  9.5  shows  that  estimated  odds  ratios  are  highly  dependent  on  the  model.  So,  good 
model  selection  is  crucial.  An  estimate  from  this  table  is  informative  only  to  the  extent  that 
its  model  fits  well.  In  the  next  section  we  discuss  goodness  of  fit. 

9.3  INFERENCE  FOR  LOGLINEAR  MODELS 

A  good-fitting  loglinear  model  provides  a  basis  for  describing  and  making  inferences  about 
associations  among  categorical  responses.  Standard  methods  apply  for  checking  fit  and 
making  inference  about  model  parameters. 

9.3.1  Chi-Squared  Goodness-of-Fit  Tests 

As  usual,  X2  and  G 2  test  whether  a  model  holds  by  comparing  cell  fitted  values  to  observed 
counts.  For  loglinear  models,  df  equals  the  number  of  cell  counts  minus  the  number  of 
model  parameters. 

For  the  student  survey  data  (Table  9.3),  Table  9.6  shows  results  of  testing  fit  for  several 
models.  Models  that  lack  any  association  term  fit  poorly.  The  model  (AC,  AM,  CM)  that  has 
all  pairwise  associations  fits  well  (P  =  0.54).  It  is  suggested  by  other  criteria  also,  such  as 
minimizing  AIC  (Section  6. 1 .6). 

9.3.2  Inference  about  Conditional  Associations 

Tests  about  conditional  associations  compare  loglinear  models.  The  likelihood-ratio  statis¬ 
tic  — 2(Lo  —  L i)  is  identical  to  the  difference  G2(Mq\M\)  =  G2(Mq)  —  G2(M\)  between 
deviances  for  models  without  that  term  and  with  it.  For  model  (XY,  XZ,  YZ),  consider  the 
hypothesis  of  XY  conditional  independence.  This  is  //<>:  A.**  =  0  for  the  (/  —  !)(/  —  !) 


Table  9.6  Goodness-of-Fit  Tests  for  Loglinear  Models  in  Table  9.4 


Loglinear  Model 

G 2 

X2 

df 

P- value" 

AIC 

(A.C.M) 

1286.0 

1411.4 

4 

<  0.001 

1343.1 

(A,  CM) 

534.2 

505.6 

3 

<  0.001 

593.3 

(C,  AM) 

939.6 

824.2 

3 

<  0.001 

998.6 

(M,  AC) 

843.8 

704.9 

3 

<  0.001 

902.9 

(AC,  AM) 

497.4 

443.8 

2 

<  0.001 

558.4 

(AC,  CM) 

92.0 

80.8 

2 

<  0.001 

153.1 

(AM,  CM) 

187.8 

177.6 

2 

<  0.001 

248.8 

(AC,  AM,  CM) 

0.4 

0.4 

1 

0.54 

63.4 

(ACM) 

0.0 

0.0 

0 

— 

65.0 

“iP-value  for  G2  deviance  statistic. 
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association  parameters  relating  X  and  Y.  The  test  statistic  is  G2(XZ,YZ)  — 
G2(XY ,  XZ,  YZ),  with  df  =  (/  -  1 )(./  -  1).  This  has  the  same  purpose  as  the  generalized 
CMH  and  model-based  tests  for  nominal  variables  presented  in  Section  8.4. 

For  instance,  the  test  of  conditional  independence  between  alcohol  use  and  cigarette 
smoking  compares  model  (AM,  CM)  with  the  alternative  (AC,  AM,  CM).  The  test 
statistic  is 


G2[(AM,CM)\(AC,  AM,  CM)]  =  187.8  -  0.4  =  187.4, 

withdf=2  —  1  =  1  (P  <  0.001).  The  statistics  comparing  (AC,  CM)  and  (AC,  AM)  with 
(AC,  AM,  CM)  also  provide  strong  evidence  of  AM  and  CM  conditional  associations.  In 
further  analyses  of  the  data,  we  use  model  (AC,  AM,  CM). 

With  large  sample  sizes,  statistically  significant  effects  can  be  weak  and  practically 
unimportant.  A  more  relevant  concern  is  whether  the  associations  are  strong  enough  to 
be  of  interest.  Confidence  intervals  are  more  useful  than  tests  for  assessing  this.  Table  9.7 
shows  output  from  fitting  model  (AC,  AM,  CM)  with  parameters  in  the  last  row  and  in  the 
last  column  equal  to  zero,  such  as  by  using  (1,0)  indicator  variables  for  each  classification. 
Consider  the  conditional  AC  odds  ratio,  assuming  model  (AC,  AM,  CM).  Table  9.7  reports 
Xff  =  2.054,  with  SE  =  0.174.  For  these  constraints,  this  is  the  estimated  conditional 
log  odds  ratio.  A  95%  Wald  confidence  interval  for  the  true  conditional  AC  odds  ratio 
is  exp[2.054±  1.96(0.174)],  or  (5.5,  11.0).  Strong  positive  association  exists  between 
cigarette  use  and  alcohol  use,  for  both  users  and  nonusers  of  marijuana. 

For  model  (AC,  AM,  CM),  the  95%  Wald  confidence  intervals  are  (8.0, 49.2)  for  the  AM 
conditional  odds  ratio  and  (12.5,  23.8)  for  the  CM  conditional  odds  ratio.  The  intervals 


Table  9.7  Software  Output  (Based  on  SAS)  for  Homogeneous  Association 
Model  Fitted  to  Table  9.3 


Criteria  For  Assessing  Goodness  Of  Fit 


Criterion 

DF  Value 

Value/DF 

Deviance 

1  0.3740 

0.3740 

Pearson  Chi-Square 

1  0.4011 

0.4011 

Standard 

Wald 

Parameter 

Estimate 

Error  Chi-Square 

Pr  >  ChiSq 

Intercept 

5.6334 

0 . 0597 

8903 . 96 

<.0001 

a 

1 

0.4877 

0 . 0758 

41.44 

<.0001 

c 

1 

-1.8867 

0.1627 

134.47 

<.0001 

m 

1 

-5.3090 

0.4752 

124 . 82 

<.0001 

a*m 

1  1 

2 .9860 

0.4647 

41.29 

<.0001 

a*c 

1  1 

2 . 0545 

0.1741 

139.32 

<.0001 

c  m 

1  1 

2 . 8479 

0.1638 

302.14 

<.0001 

LR  Statistics 

Source 

DF 

Chi-Square 

Pr  >  ChiSq 

a*m 

1 

91.64 

<.0001 

a*c 

1 

187 . 38 

<.0001 

c*m 

1 

497 . 00 

<.0001 

c  m 
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are  wide,  but  these  associations  also  are  strong.  Table  9.5  showed  that  estimated  marginal 
associations  are  even  stronger.  Controlling  for  outcome  on  one  response  moderates  the 
association  somewhat  between  the  other  two. 

The  analyses  in  this  section  pertain  to  associations.  A  different  analysis  pertains  to 
comparing  single-variable  marginal  distributions,  for  instance,  to  determine  if  students 
used  cigarettes  more  than  alcohol  or  marijuana.  That  type  of  analysis  is  presented  in 
Section  11.1. 


9.4  LOGLINEAR  MODELS  FOR  HIGHER  DIMENSIONS 

Loglinear  models  for  three-way  tables  extend  readily  to  multiway  tables.  As  the  number 
of  dimensions  increases,  some  complications  arise.  One  is  the  increase  in  the  number  of 
possible  association  and  interaction  terms,  making  model  selection  more  difficult.  Another 
is  the  increase  in  number  of  cells.  In  Section  10.6  we  show  that  this  can  cause  difficulties 
with  existence  of  estimates  and  appropriateness  of  some  large-sample  theory. 

9.4.1  Models  for  Four- Way  Contingency  Tables 

We  illustrate  models  for  higher  dimensions  using  a  four-way  table  with  variables  W,  X,  Y, 
and  Z.  Interpretations  are  simplest  when  the  model  has  no  three-factor  interaction  terms, 
so  that  each  pairwise  association  is  homogeneous.  Such  models  are  special  cases  of 

•og  ILhijk  —  A.  +  A.J*  +  X?  +  XYj  +  Xf 

+  +  XWY  4-i*’Z4-  l*1,  4-  1XZ  4-  XYZ 

-+■  Ahj  -t-  A hj  -t  Ahk  -t-  A jj  -t-  Ajk  -t-  A jk  , 

denoted  by  (WX,  WY,  WZ,  XY,  XZ,  YZ).  Each  pair  of  variables  is  conditionally  dependent, 
with  the  same  odds  ratios  at  each  combination  of  categories  of  the  other  two  variables. 
An  absence  of  a  two-factor  term  implies  conditional  independence,  given  the  other  two 
variables. 

A  variety  of  models  exhibit  three-factor  interaction.  A  model  could  contain  any  of 
WXY,  WXZ,  WYZ,  or  XYZ  terms.  For  model  (WXY,  WZ,  XZ,  YZ),  each  pair  of  variables  is 
conditionally  dependent,  but  at  each  level  of  Z  the  WX  association,  the  WY  association,  and 
the  XY  association  may  vary  across  categories  of  the  remaining  variable.  The  conditional 
association  between  Z  and  another  variable  is  homogeneous.  The  saturated  model  contains 
all  the  three-factor  terms  plus  a  four-factor  interaction  term. 

9.4.2  Example:  Automobile  Accidents  and  Seat-Belt  Use 

Table  9.8  summarizes  observations  of  68,694  passengers  in  autos  and  light  trucks  involved 
in  accidents  one  year  in  the  state  of  Maine.  The  table  classifies  passengers  by  gender  (G), 
location  of  accident  ( L ),  seat-belt  use  (S),  and  injury  (/).  Table  9.8  reports  the  sample 
proportion  of  passengers  who  were  injured.  For  each  GL  combination,  the  proportion  of 
injuries  was  about  halved  for  passengers  wearing  seat  belts. 

Table  9.9  displays  tests  of  fit  for  several  loglinear  models.  To  investigate  the  complexity 
of  model  needed,  we  consider  models  (G,  /,  L,  S),  ( Gl ,  GL,  GS,  IL,  IS.  LS),  and  (GIL.  GIS, 
GLS,  ILS)  having  all  terms  of  varying  complexity.  Model  ( G.I.L.S )  of  mutual  independence 
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Table  9.8  Loglinear  Models  for  Injury,  Seat-Belt  Use,  Gender,  and  Location” 


Gender 

Location 

Seat 

Belt 

Use 

Injury 

Observed 

Injury 

(GI,  GL,  GS,  IL, 

IS,  LS) 

Injury 

(GLS,  GI,  IL,  IS) 

Sample 

Proportion 

Yes 

No 

Yes 

No 

Yes 

No 

Yes 

Female 

Urban 

No 

7,287 

996 

7,166.4 

993.0 

7,273.2 

1,009.8 

0.12 

Yes 

11,587 

759 

11,748.3 

721.3 

11,632.6 

713.4 

0.06 

Rural 

No 

3,246 

973 

3,353.8 

988.8 

3,254.7 

964.3 

0.23 

Yes 

6,134 

757 

5,985.5 

781.9 

6,093.5 

797.5 

0.11 

Male 

Urban 

No 

10,381 

812 

10,471.5 

845.1 

10,358.9 

834.1 

0.07 

Yes 

10,969 

380 

10,837.8 

387.6 

10,959.2 

389.8 

0.03 

Rural 

No 

6,123 

1,084 

6,045.3 

1,038.1 

6,150.2 

1,056.8 

0.15 

Yes 

6,693 

513 

6,811.4 

518.2 

6,697.6 

508.4 

0.07 

aG,  gender;  /,  injury;  L,  location;  .S’,  seat-belt  use. 

Source:  Data  courtesy  of  Cristanna  Cook,  Medical  Care  Development,  Augusta,  Maine. 


fits  very  poorly.  Model  ( GI ,  GL,  GS ,  IL,  IS,  LS)  fits  much  better  but  still  has  a  lack  of  fit 
(P  <  0.001).  Model  (GIL,  GIS,  GLS,  ILS )  fits  well  (G2  =  1 .33,  df  =  1)  but  is  complex  and 
difficult  to  interpret.  This  suggests  studying  models  more  complex  than  (GI,  GL,  GS,  IL, 
IS,  LS)  but  simpler  than  (GIL,  GIS,  GLS,  ILS). 

First,  however,  we  analyze  model  (GI,  GL,  GS,  IL,  IS,  LS),  which  focuses  on  pairwise 
associations.  Table  9.8  displays  its  fitted  values.  Table  9.10  reports  the  model-based  esti¬ 
mated  conditional  odds  ratios.  We  can  obtain  them  directly  from  parameter  estimates;  for 
instance,  0.44  =  exp(A.{f  +  k2l  —  A.{f  -  A^f). 

Since  the  sample  size  is  large,  the  estimates  of  odds  ratios  are  quite  precise.  For  instance, 
the  standard  error  of  the  estimated  IS  conditional  log  odds  ratio  of  —0.814  is  0.028.  A  95% 
Wald  confidence  interval  for  the  true  odds  ratio  is  exp[— 0.814  ±  1.96(0.028)]  or  (0.42, 
0.47).  This  model  estimates  that  the  odds  of  injury  for  passengers  wearing  seat  belts 
were  less  than  half  the  odds  for  passengers  not  wearing  them,  at  each  gender-by-location 
combination.  The  fitted  odds  ratios  in  Table  9. 10  also  suggest  that  other  factors  being  fixed, 
injury  was  more  likely  in  rural  than  urban  accidents  and  more  likely  for  females  than  for 
males.  The  estimated  odds  that  males  used  seat  belts  were  0.63  times  the  estimated  odds 
for  females,  conditional  on  each  combination  of  I  and  L  categories. 

Interpretations  are  more  complex  for  models  containing  three-factor  interaction  terms. 
Table  9.9  shows  results  of  adding  a  single  three-factor  term  to  model  (GI,  GL,  GS,  IL,  IS, 


Table  9.9  Goodness-of-Fit  Tests  for  Loglinear  Models  Fitted  to  Table  9.8 


Model 

G 2 

df 

P- Value 

AIC 

(G,  /,  L,  S) 

2792.77 

11 

<0.0001 

2956.2 

(GI,  GL,  GS,  IL,  IS,  LS) 

23.35 

5 

<0.001 

198.8 

(GIL,  GIS,  GLS,  ILS ) 

1.33 

1 

0.25 

184.8 

(GIL,  GS,  IS,  LS) 

18.57 

4 

0.001 

196.0 

(GIS,  GL,  IL,  LS) 

22.85 

4 

<0.001 

200.3 

(GLS,  GI,  IL,  IS) 

7.46 

4 

0.11 

184.9 

(ILS,  GI,  GL,  GS) 

20.63 

4 

<0.001 

198.1 
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Table  9.10  Estimated  Conditional  Odds  Ratios  for  Models  for  Table  9.8 


Odds  Ratio 

Loglinear  Model 

(GI,  GL,  GS,  IL,  IS,  LS) 

(GLS,  GI,  IL,  IS) 

GI 

0.58 

0.58 

IL 

2.13 

2.13 

IS 

0.44 

0.44 

GL  S 

=  (no,  yes) 

(1.23,  1.23) 

(1.33,  1.17) 

GS  L 

=  (urban,  rural) 

(0.63,  0.63) 

(0.66,  0.58) 

LS  G 

=  (female,  male) 

(1.09,  1.09) 

(1.17,  1.03) 

LS ).  Of  the  four  possible  models,  ( GLS ,  GI,  IL ,  IS)  fits  best  and  has  AIC  essentially  the 
same  as  the  model  (GIL,  CIS,  GLS,  ILS).  Table  9.8  also  displays  its  fit.  Considering  that 
the  sample  size  is  very  large,  its  G2  value  suggests  that  it  fits  quite  well.  For  this  model, 
each  pair  of  variables  is  conditionally  dependent,  and  at  each  category  of  /  the  association 
between  any  two  of  the  others  varies  across  categories  of  the  remaining  variable.  For  this 
model,  it  is  inappropriate  to  interpret  the  GL,  GS,  and  LS  two-factor  terms  on  their  own. 
Since  /  does  not  occur  in  a  three-factor  interaction,  the  conditional  odds  ratio  between  I 
and  each  variable  (see  the  top-right  portion  of  Table  9. 10)  is  the  same  at  each  combination 
of  categories  of  the  other  two  variables. 

When  a  model  has  a  three-factor  interaction  term  but  no  term  of  higher  order  than  that, 
we  can  study  the  interaction  by  calculating  fitted  odds  ratios  between  two  variables  at  each 
level  of  the  third.  We  can  do  this  at  any  levels  of  remaining  variables  not  involved  in  the 
interaction.  The  bottom-right  portion  of  Table  9.10  illustrates  this  for  model  (GLS,  GI,  IL, 
IS).  For  instance,  the  fitted  GS  odds  ratio  of  0.66  for  (L  =  urban)  refers  to  four  fitted  values 
for  urban  accidents,  both  the  four  with  (injury  =  no)  and  the  four  with  (injury  =  yes);  for 
example,  0.66  =  (7273.2  x  10,  959.2)/(  1 1 , 632.6  x  10,  358.9). 

9.4.3  Large  Samples  and  Statistical  Versus  Practical  Significance 

Model  (GLS,  GI,  IL,  IS)  seems  to  fit  much  better  than  (GI,  GL,  GS,  IL,  IS,  LS).  The  difference 
in  G2  values  of  23.4  —  7.5  =  15.9  has  df  =  5  —  4  =  1  (P  =  0.0001).  Table  9.10  indicates, 
however,  that  the  degree  of  three-factor  interaction  is  weak.  The  fitted  odds  ratio  between 
any  two  of  G,  L,  and  S  is  similar  at  both  levels  of  the  third  variable.  The  significantly  better 
fit  of  model  (GLS,  GI,  IL,  IS)  reflects  mainly  the  enormous  sample  size. 

As  in  any  test,  a  statistically  significant  effect  need  not  be  practically  important.  With 
huge  samples,  it  is  better  to  focus  on  estimation  rather  than  hypothesis  testing.  For  instance, 
a  comparison  of  fitted  odds  ratios  for  the  two  models  in  Table  9. 10  suggests  that  the  simpler 
model  (GI,  GL,  GS,  IL,  IS,  LS)  is  adequate  for  most  purposes. 

9.4.4  Dissimilarity  Index 

For  a  table  of  arbitrary  dimension  with  cell  counts  \n,  =  rip,}  and  fitted  values  [p,  =  ni r,- }, 
the  dissimilarity  index  (Gini  1914b) 

A  =  “  Ail/2n  =  1  Pi  ~  Ai  I/2 

i  i 
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summarizes  how  far  the  model  fit  falls  from  the  data.  This  index  falls  between  0  and  1 , 
with  larger  values  representing  a  poorer  fit.  It  represents  the  proportion  of  sample  cases  that 
must  move  to  different  cells  for  the  model  to  fit  perfectly. 

The  dissimilarity  index  A  estimates  a  corresponding  population  index  A  describing 
model  lack  of  fit.  The  value  A  =  0  occurs  when  the  model  holds  perfectly.  In  practice,  this 
is  unrealistic  for  unsaturated  models,  and  A  >  0.  The  estimator  A  helps  study  whether  the 
lack  of  fit  is  important  in  a  practical  sense.  When  A  <  0.02  or  0.03,  the  sample  data  follow 
the  model  pattern  quite  closely,  even  though  the  model  is  not  perfect. 

For  Table  9.8,  model  (GI,  GL,  GS,  IL,  IS,  LS )  has  A  =  0.008,  and  model  (GLS,  GI,  IL, 
IS)  has  A  =  0.003.  For  either  model,  moving  less  than  1%  of  the  data  yields  a  perfect  fit. 
The  relatively  large  G 2  value  for  (GI,  GL,  GS,  IL,  IS,  LS)  indicated  that  it  does  not  truly 
hold.  Nevertheless,  the  small  A  value  suggests  that,  in  practical  terms,  it  fits  decently. 

When  A  is  nearO,  A  tends  to  overestimate  A,  substantially  so  for  small  n.  Kuha  and  Firth 
(201 1)  provided  an  approximate  variance  for  A  and  studied  ways  to  reduce  its  estimation 
bias. 


9.5  LOGLINEAR— LOGISTIC  MODEL  CONNECTION 

Loglinear  models  treat  categorical  response  variables  symmetrically,  focusing  on  associ¬ 
ations  and  interactions  in  their  joint  distribution.  Logistic  models,  by  contrast,  describe 
how  a  single  categorical  response  depends  on  explanatory  variables.  The  model  types 
seem  distinct,  but  connections  exist  between  them.  For  a  loglinear  model,  forming  logits 
on  one  response  helps  to  interpret  the  model.  Moreover,  logistic  models  with  categorical 
explanatory  variables  have  equivalent  loglinear  models  (Bishop  1969). 

9.5.1  Using  Logistic  Models  to  Interpret  Loglinear  Models 

To  understand  implications  of  a  loglinear  model  formula,  it  can  help  to  form  a  logit  on  one 
variable.  We  illustrate  with  the  loglinear  model  (XY,  XZ,  YZ).  When  Y  is  binary,  its  logit  is 

log  —  =  log  /ink  -  log  link 
P-i2k 

{k  +  kx+kr+kz+kxr+k  «+*«) 

-(x  +  xf+xl+xf  +  xfl+xZ  +  xlf) 

M-ib  +  iW-W  +  uJS-ig). 

The  first  parenthetical  term  is  a  constant,  not  depending  on  /'  or  k.  The  second  parenthetical 
term  depends  on  the  category  /  of  X.  The  third  parenthetical  term  depends  on  the  category 
k  of  Z.  This  logit  has  the  additive  form 

logit[F(T  =  1|  X  =i,Z  =  k)]=a  +  P*  +  (9.15) 

Using  the  notation  summarizing  logistic  models  by  their  predictors,  we  denote  it  by 
(X  +  Z). 


P(Y  =  1|  X  =i,Z  =  k) 

log - — - 

6  P(Y  =  2\X  =  i,  Z  —  k) 
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In  Section  5.4.1  we  discussed  this  logistic  model.  When  Y  is  binary,  the  loglinear  model 
( XY ,  XZ,  YZ)  is  equivalent  to  it.  The  Azz  terms  for  association  among  explanatory  variables 
cancel  in  the  difference  in  logarithms  that  the  logit  defines.  The  logistic  model  does  not 
study  this  association. 

9.5.2  Example:  Auto  Accidents  and  Seat-Belts  Revisited 

For  the  Maine  auto  accidents  (Table  9.8),  in  Section  9.4.2  we  showed  that  the  loglinear 
model  (GLS,  GI,  LI,  IS), 

log  t^giis  =  A  +  kg  +  A,7  +  +  Aj  +  A^/  +  A®/"  +  kgf 

,  ,/t  ,  .  IS  ,  -,LS  i  -,GLS 
'A it  ■+'  Ais  Ats  “T  Agls  ’ 

fits  well.  It  is  natural  to  treat  injury  (/)  as  a  response  variable  and  gender  (G),  location 
(L),  and  seat-belt  use  (S)  as  explanatory  variables,  or  perhaps  S  as  a  response  with  G  and 
L  as  explanatory.  One  can  show  that  this  loglinear  model  is  equivalent  to  logistic  model 
(G  +  L  +  S), 

logit[/>(7  =  1|  G  =  g,L=e,S  =  s)]  =  a  +  p°  +tf+  (9.16) 

For  instance,  the  seat-belt  effects  in  the  two  models  satisfy  pf  —  Ajf  -  Ajf .  In  the  logit 
calculation,  all  terms  in  the  loglinear  model  not  having  the  injury  index  i  cancel.  Fitted 
values,  goodness-of-fit  statistics,  residual  df,  and  standardized  residuals  for  the  logistic 
model  are  identical  to  those  for  the  loglinear  model. 

Odds  ratios  describing  effects  on  I  relate  to  two-factor  loglinear  parameters  and  main- 
effect  logistic  parameters.  In  the  logistic  model,  the  log  odds  ratio  for  the  effect  of  5  on  / 
equals  /Sf  —  p\.  This  equals  A ( f  +  A^|  —  A{|  —  A^  in  the  loglinear  model.  Their  estimates 
are  the  same  no  matter  how  software  sets  up  constraints.  For  Table  9.8,  —  fil  =  —0.817 

for  the  logistic  model,  and  A{^  +  A^f  —  A{|  —  A^f  =  —0.817  forthe  loglinear  model. 

Loglinear  models  are  GLMs  that  treat  the  16  cell  counts  in  Table  9.8  as  16  independent 
Poisson  variates.  Logistic  models  are  GLMs  that  treat  the  table  as  binomial  counts.  Logistic 
models  with  I  as  the  response  treat  the  marginal  GLS  table  {nH+ts }  as  fixed  and  regard  \ng\es } 
as  eight  independent  binomial  variates  on  that  response.  Although  the  sampling  models 
differ,  the  results  from  fits  of  corresponding  models  are  identical. 

9.5.3  Equivalent  Loglinear  and  Logistic  Models 

In  the  derivation  of  the  logistic  model  (X  +  Z)  [see  (9.15)]  from  loglinear  model  (XY, 
XZ,  YZ),  the  Azz  term  cancels.  It  might  seem  as  if  the  model  (XY,  YZ)  omitting  this 
term  is  also  equivalent  to  that  logistic  model.  Indeed,  forming  the  logit  on  Y  for  (.XY, 
YZ)  results  in  the  same  logistic  formula.  The  loglinear  model  that  has  the  same  fit  as  the 
logistic  model,  however,  contains  a  general  interaction  term  for  relationships  among  the 
explanatory  variables.  The  logistic  model  does  not  assume  anything  about  relationships 
among  explanatory  variables,  so  it  allows  an  arbitrary  interaction  pattern  for  them. 

Table  9.11  summarizes  equivalent  logistic  and  loglinear  models  for  three-way  tables 
when  Y  is  a  binary  response.  Each  loglinear  model  contains  the  XZ  association  term 
relating  the  explanatory  variables  in  the  logistic  models.  The  saturated  loglinear  model 
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Table  9.11  Equivalent  Loglinear  and  Logistic  Models  for  a 
Three-Way  Table  with  Binary  Response  Variable  Y 


Loglinear  Symbol 

Logistic  Model 

Logistic  Symbol 

(Y,  XZ) 

a 

(— ) 

(XY,  XZ) 

a  +  (if 

(X) 

(YZ,  XZ) 

a  +  tf 

(Z) 

(XY,  YZ,  XZ) 

a  +  P?  +  ft2 

(X+Z) 

(XYZ) 

a  +  tf+Pf+fi? 

(X*Z) 

(XYZ)  is  equivalent  to  a  logistic  model  with  an  interaction  between  the  predictors  X  and 
Z.  For  instance,  the  effect  of  X  on  Y  depends  on  Z,  meaning  that  the  XY  odds  ratio  varies 
across  its  categories.  That  logistic  model  is  also  saturated.  Analogous  correspondences  hold 
when  Y  has  several  categories,  using  baseline-category  logit  models.  An  advantage  of  the 
loglinear  approach  is  its  generality.  It  applies  when  more  than  one  response  variable  exists. 
The  alcohol-cigarette-marijuana  example  in  Section  9.2.4,  for  instance,  used  loglinear 
models  to  study  association  patterns  among  three  response  variables. 

9.5.4  Example:  Detecting  Gene-Environment  Interactions  in  Case-Control  Studies 

Considerable  research  in  recent  years  has  focused  on  the  role  that  gene-environment 
interactions  may  play  in  complex  diseases.  An  environmental  exposure  can  markedly 
increase  the  risk  of  a  disease  in  a  genetically  susceptible  subgroup  but  have  little  or  no 
effect  for  others  (Umbach  and  Weinberg  1997).  Case-control  studies  are  often  used  to 
investigate  such  interactions. 

For  a  disease  outcome  Y,  binary  genetic  factor  G  (such  as  1  for  the  variant  genotype  and 
0  for  the  “wild  type”)  and  binary  environmental  factor  E,  consider  the  model 

logit[P(F  =  1|G  =  g,  E  —  e)]  =  a  +  fog  +  foe  +  foge, 

where  g  and  e  each  take  values  0  and  1 .  The  corresponding  loglinear  model  is 

log  iiCRy  -  a0  +  A )g  +  yoe  +  X^ge  +  ay  +  fogy  +  foey  +  fogey, 

where  a,  A,  fo ,  and  fo  are  the  same  in  each  model.  Without  further  restrictions,  each 
model  is  saturated.  For  this  parameterization,  the  log  odds  ratio  between  G  and  E  is  Ao  when 
y  =  0  and  (Ao  +  A)  when  y  —  1.  In  case-control  studies,  Piegorsch  et  al.  (1994)  noted 
that  it  is  often  reasonable  to  assume  that  A-o  =  0,  that  is,  that  genotype  and  environmental 
exposure  are  independent  in  the  control  population.  This  assumption  corresponds  to  an 
unusual  instance  in  which  a  nonhierarchical  model  makes  sense  biologically.  The  ML 
estimate  of  the  interaction,  fo  —  (wm«ooi)/(wioi«on)>  dePends  only  on  the  counts  for  the 
cases. 

Piegorsch  et  al.  (1994)  noted  that  such  a  case-only  analysis  can  provide  a  more  efficient 
estimate  of  the  interaction  than  obtained  using  all  the  data.  However,  results  are  biased 
if  the  independence  assumption  is  violated.  Umbach  and  Weinberg  (1997)  and  Li  and 
Conti  (2009)  discussed  this  issue,  the  latter  article  using  Bayesian  model  averaging  to 
combine  results  from  case-only  and  case-control  analyses  to  reduce  potential  bias.  Further 
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complicating  issues  are  that  the  scale  on  which  gene-environment  interaction  is  most 
pronounced  may  not  be  the  one  used  in  standard  models,  and  a  statistical  interaction  may 
not  have  a  biological  interpretation. 


9.6  LOGLINEAR  MODEL  FITTING:  LIKELIHOOD  EQUATIONS  AND 
ASYMPTOTIC  DISTRIBUTIONS 

We  next  discuss  loglinear  model  fitting.  After  deriving  sufficient  statistics  and  likelihood 
equations,  we  present  large-sample  normal  distributions  for  ML  estimators  of  model  pa¬ 
rameters  and  cell  probabilities.  We  illustrate  results  with  models  for  three-way  tables. 
For  simplicity,  derivations  use  the  Poisson  sampling  model,  which  does  not  require  the 
constraint  on  {lijjk)  that  the  multinomial  has. 

9.6.1  Minimal  Sufficient  Statistics 

For  three-way  tables,  the  joint  Poisson  probability  that  cell  counts  { =  n,^}  is 


with  product  taken  over  all  cells  of  the  table.  The  kernel  of  the  log  likelihood  is 

L (#o  =  EEE  nu*  *°g  w  ~  E  E  E  (9- 17) 

i  j  k  i  j  k 

For  the  general  loglinear  model  (9. 12),  this  simplifies  to 

L(ii)  =  nX  +  E  ni++Xf  +  E  w+y+  n++k 

i  j  k 

+ E  E  ”<>'+  x¥  +  E  E  n‘+k  +  E  E  n+Jk  rf? 

i  j  i  k  j  k 

+ E  E  E  n‘ik  x%z  -  E  E  E  exp(A + ■  •  ■ + xfz)-  (9- 1 8) 

i  j  k  i  j  k 

Since  the  Poisson  distribution  is  in  the  exponential  family,  coefficients  of  the  parameters 
are  sufficient  statistics.  For  this  saturated  model,  {«,)*}  are  coefficients  of  { } ,  so  there 
is  no  reduction  of  the  data.  For  simpler  models,  certain  parameters  are  zero  and  (9.18) 
simplifies.  For  instance,  for  the  model  (X,  Y,  Z)  of  mutual  independence,  sufficient  statistics 
are  the  coefficients  in  (9. 18)  of  {Xf },  {Ay  },  and  {Xf}.  These  are  {n,++},  {n+y+},  and  {n++k  }• 
Table  9.12  lists  minimal  sufficient  statistics  for  several  loglinear  models.  Each  one  is  the 
coefficient  of  the  highest-order  term(s)  in  which  a  variable  appears.  We  see  that  they  are  the 
marginal  distributions  for  terms  in  the  model  symbol.  Simpler  models  use  more  condensed 
sample  information.  For  instance,  whereas  (X,  Y,  Z)  uses  only  the  single-factor  marginal 
distributions,  (XT,  XZ,  YZ)  uses  the  two-way  marginal  tables. 
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Table  9.12  Minimal  Sufficient  Statistics  for 
Loglinear  Models  in  Three-Way  Tables 


Model 

Minimal  Sufficient  Statistics 

(. X ,  Y,  Z) 

(«/++},  Ky+l.  K+t l 

(XY,  Z ) 

K+l.  !«++*) 

(XY,  YZ) 

{«/>+)»  («+;*! 

(XY,  XZ,  YZ) 

K+l.  K+*}>  K,tl 

9.6.2  Likelihood  Equations  for  Loglinear  Models 

The  fitted  values  for  a  model  are  solutions  to  the  likelihood  equations.  We  derive  likelihood 
equations  in  terms  of  a  general  formula  fora  loglinear  model.  Let  n  —  (ri\, . . . ,  nN)r  and 
M  =  (Mi  >  •  •  • .  Mw)r  denote  column  vectors  of  observed  and  expected  counts  for  the  A?  cells 
of  a  contingency  table,  with  n  =  JT  For  simplicity  we  use  a  single  index,  but  the  table 
may  be  multidimensional.  Loglinear  models  for  positive  Poisson  means  have  the  form 

log  It  =  X  (9.19) 

for  a  model  matrix  A"  and  column  vector  /?  of  model  parameters.  For  example,  consider 
the  independence  model,  log  My  =  X  +  X?  +  X J ,  for  a  2  x  2  table.  With  constraints  k\  = 
k\  =  0,  it  is 


log  Mil  ' 

"i  i  r 

X 

log  M  12 

1  1  0 

log  M 21 

1  0  1 

Ai 

_log  M22_ 

1  00 

_A1  _ 

For  the  model  log  ft  =  Xfl,  we  have  log(/i,)  =  xy fij  for  all  i.  Extending  (9.17),  for 
Poisson  sampling  the  log  likelihood  is 


L ift)  =  Hn<  log  M;  ~  H 


^exp  J2XU Pj 


The  sufficient  statistic  for  fjj  is  its  coefficient,  JT  rijXij.  Since 


(9.20) 


3 

Wj 


exp 


xu  Pj 


—  Xij  exp 


=  xy  m. 


3  Pj 


H  n‘x‘j  ~  H  j =  1  ’ 2> 

/  i 


p- 


The  likelihood  equations  equate  these  derivatives  to  zero.  They  have  the  form 


XTn  -  XT (L. 


(9.21) 
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These  equations  equate  the  sufficient  statistics  to  their  expected  values,  a  result  obtained 
with  GLM  theory  in  (4.32).  For  models  considered  so  far,  these  sufficient  statistics  are 
the  marginal  tables  in  the  model  symbol.  To  illustrate,  consider  model  (XZ,  YZ).  Its  log 
likelihood  is  (9.18)  with  XXY  =  XXYZ  =  0.  The  log-likelihood  derivatives 

a  L  a  L 

,, ,  xz  —  ^i+k  M»- and  „ ,  yy  ^+jk  M+M 
dXjk  oXjk 


yield  the  likelihood  equations 


fij+k  =  rij+k  for  all  /  and  k,  (9.22) 

jl+jit  =  n+jk  for  all  j  and  k .  (9.23) 

Derivatives  with  respect  to  lower-order  terms  yield  equations  implied  by  these  (Exercise 
9.29).  For  model  (XZ,  YZ),  the  fitted  values  have  the  same  XZ  and  YZ  marginal  totals  as  the 
observed  data. 

9.6.3  Unique  ML  Estimates  Match  Data  in  Sufficient  Marginal  Tables 

For  model  (XZ,  YZ),  from  (9.22),  (9.23),  and  Table  9.12,  the  minimal  sufficient  statistics 
are  the  ML  estimates  of  the  corresponding  marginal  distributions  of  expected  frequencies. 
Equation  (9.21)  gives  the  corresponding  result  for  any  loglinear  model.  Birch  (1963) 
showed  that  likelihood  equations  for  loglinear  models  match  minimal  sufficient  statistics 
to  their  expected  values.  Poisson  GLM  theory  implied  this  result  in  (4.32)  and  (4.51). 
Thus,  fitted  values  for  loglinear  models  are  smoothed  versions  of  the  cell  counts  that  match 
them  in  certain  marginal  distributions  but  have  associations  and  interactions  satisfying  the 
model-implied  patterns. 

Birch  showed  that  a  unique  set  of  fitted  values  both  satisfy  the  model  and  match  the  data 
in  the  minimal  sufficient  statistics.  Flence,  if  we  find  such  a  solution,  it  must  be  the  ML 
solution.  To  illustrate,  the  independence  model  for  a  two-way  table 

log  My  =  x  +  XX  +  XYJ 

has  minimal  sufficient  statistics  («,+  }  and  (n+;).  The  likelihood  equations  are 
M/+  =  n,+,  jl+j  —  n+j,  for  all  /  and  j. 

The  fitted  values  (jl,j  —  rij+n+j/n }  satisfy  these  equations  and  also  satisfy  the  model. 
Birch’s  result  implies  that  they  are  the  ML  estimates. 

9.6.4  Direct  Versus  Iterative  Calculation  of  Fitted  Values 

To  illustrate  how  to  solve  likelihood  equations,  we  continue  the  analysis  of  model  (XZ,  YZ). 
From  (9.9),  the  model  satisfies 

7 ijjic  =  — — — —  for  all  i,  j,  and  k. 
x++k 

For  Poisson  sampling,  the  related  formula  uses  expected  frequencies.  Setting  tt,,*  =  M/m/w- 
this  is  {[Ajjk  =  Mt//,/m++*  !•  The  likelihood  equations  (9.22)  and  (9.23)  specify  that 
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Table  9.13  Fitted  Values  for  Loglinear  Models  in  Three-Way  Tables 


Model" 

Probabilistic  Form 

Fitted  Value 

(X.  Y,  Z) 

ni++n+j  +  n++k 

nijk  —  x  i ++x + j +K ++k 

f^ijk  —  2 

nl 

(XY.  Z) 

H  ijk  —  Hij+Jt++k 

«,;;+«++* 

l*ijk  = 

n 

(XY.  XZ) 

Hij+Jti+k 

A  Mij+fti+k 

Kijk  — 

Pijk  — 

Jl7++ 

ni++ 

(XY,  XZ,  YZ) 

rtijk  =  fij<Pjk(*>ik 

Iterative  methods  (Section  9.7) 

(XYZ) 

No  restriction 

ft  ijk  —  ft  ijk 

"Formulas  for  models  not  listed  are  obtained  by  symmetry;  for  example,  for 
(XZ,  T).  fiijk  =  m+kn+j+/n. 

ML  estimates  satisfy  A;+*  =  nj+k  and  A+/i  =  n  +jf  and  thus  also  A++*  =  n++k-  Since  ML 
estimates  of  functions  of  parameters  are  the  same  functions  of  the  ML  estimates  of  those 
parameters. 


»  Ai-Hf  P+jk  ni+k  n+jk 

Pijk  =  — - -  =  - -• 

p++k  n++k 

This  solution  satisfies  the  model  and  matches  the  data  in  the  sufficient  statistics.  Thus,  it  is 
the  unique  ML  solution. 

Similar  reasoning  produces  (Ayr)  for  all  except  one  model  in  Table  9.12.  Table  9.13 
shows  formulas.  That  table  also  expresses  {n^}  in  terms  of  marginal  probabilities.  These 
expressions  and  the  likelihood  equations  determine  the  ML  formulas,  using  the  approach 
just  described. 

For  models  having  explicit  formulas  for  Ay*,  the  estimates  are  said  to  be  direct.  Many 
loglinear  models  do  not  have  direct  estimates.  ML  estimation  then  requires  iterative  meth¬ 
ods.  Of  models  in  Tables  9.12  and  9.13,  the  only  one  not  having  direct  estimates  is 
(. XY ,  XZ,  YZ).  Although  the  two-way  marginal  tables  are  its  minimal  sufficient  statis¬ 
tics,  it  is  not  possible  to  express  {tt^}  directly  in  terms  of  {jr,y+},  {Tti+k},  and  {tt+jk}-  Direct 
estimates  do  not  exist  for  unsaturated  models  containing  all  two-factor  associations. 

9.6.5  Decomposable  Models 

Unsaturated  loglinear  models  that  have  direct  ML  estimates  have  interpretations  in  terms 
of  independence,  conditional  independence,  or  equiprobability.  For  those  models,  expected 
frequencies  decompose  into  products  and  ratios  of  expected  marginal  sufficient  statistics. 
Such  models  are  called  decomposable  (Andersen  1974).  For  model  ( XZ ,  YZ)  of  XY  condi¬ 
tional  independence,  for  example,  from  (9.9),  p,jk  =  Pi+k  P+jk/P++k- 

In  practice,  it  is  not  essential  to  know  which  models  have  direct  estimates.  Iterative 
methods  for  models  not  having  direct  estimates  also  apply  with  models  that  have  direct 
estimates.  Statistical  software  for  loglinear  models  uses  such  iterative  methods  for  all  cases. 

9.6.6  Chi-Squared  Goodness-of-Fit  Tests 

Model  goodness-of-fit  statistics  compare  sample  cell  counts  to  fitted  counts.  For  Poisson 
GLMs,  in  Section  4.5.2  we  showed  that  for  models  with  an  intercept  term,  the  deviance 
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Table  9.14  Residual  Degrees  of  Freedom 
for  Loglinear  Models  for  Three-Way  Tables 


Model 

Degrees  of  Freedom 

(X,  Y,  Z) 

UK  -  I  - J - K +2 

(XY,  Z) 

(K  ~  WJ  -  1) 

(XZ,  Y) 

(J  -  \)(IK  -  1) 

(YZ.X) 

(/  -  1  )(JK  -  1) 

(XY,  YZ) 

J(1  -  l)(K  -  1) 

(XZ,  YZ) 

K(l  -  l)(J  -  1) 

(XY,  XZ) 

/(J-D(K-I) 

(XY,  XZ,  YZ) 

(/-  l)(J-  l)(K-  1) 

(XYZ) 

0 

equals  the  G 2  statistic.  With  a  fixed  number  of  cells,  G2  and  X 2  have  approximate  chi- 
squared  null  distributions  when  expected  frequencies  are  large.  The  df  equal  the  difference  in 
dimension  between  the  alternative  and  null  hypotheses.  This  equals  the  difference  between 
the  number  of  parameters  in  the  general  case  and  when  the  model  holds. 

We  illustrate  with  model  ( X ,  Y,  Z),  for  multinomial  sampling  with  probabilities  {7Ty*}. 
In  the  general  case,  the  only  constraint  is  X,  X/  Xt  n>jk  =  1,  so  there  are  /  J K  -  1 
parameters.  For  model  (X,  Y,  Z),  {jr,y*  =  ni++  n+j+  Ti++k}  are  determined  by  /  -  1  of 
{7T/++}  (since  X,  */++  =  I),  J  -  1  of  {*+;+},  and  K  -  1  of  {*++*}.  Thus, 

df  =  {IJK  -  1)  -  [(/  -  1)  +  (J  -l )  +  (K  -  1)]  =  IJK  -  /  -  J  -  K  +2. 

The  same  df  formula  applies  for  Poisson  sampling.  Then,  the  general  case  has  IJK  {/ly * ) 
parameters.  The  residual  df  equal  the  number  of  cells  in  the  table  minus  the  number  of  pa¬ 
rameters  in  the  Poisson  loglinear  model  for  j/r,^).  For  instance,  model  (X,  Y,  Z )  has  residual 
df  =  IJK  —  [[+(/  —  1)  +  (J  —  1)  -(-  (K  —  1)],  reflecting  the  intercept  parameter  X  and 
constraints  such  as  Xf  =  XY  =  Xj-  =  0.  This  equals  the  number  of  linearly  independent 
parameters  equated  to  zero  in  the  saturated  model  to  obtain  the  given  model.  Table  9.14 
shows  df  formulas  for  testing  three-way  loglinear  models. 


9.6.7  Covariance  Matrix  of  ML  Parameter  Estimators 

To  present  large-sample  distributions  of  ML  parameter  estimators,  we  return  to  general 
expression  log(/r,)  =  Xy  xijPj<  from  which  we  obtained  the  log-likelihood  derivatives 


<kL(n) 

3  Pj 


X] n'x'j  -  XI  j =  1  ’  2>  •  •  •  ’  p- 

I  i 


The  Hessian  matrix  of  second  partial  derivatives  has  elements 


3  2L(fi)  ^  3/L; 


=  -£ 


9ft 


exp 


=  -  X  XUXikU 


9ft  9ft 
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Like  logistic  regression  models,  loglinear  models  are  GLMs  using  the  canonical  link 
function;  thus,  this  matrix  does  not  depend  on  the  observed  data.  The  information  matrix, 
the  negative  of  this  matrix,  is 


J  =  XTDiag(/i)X, 

where  Diag(/z)  is  a  diagonal  matrix  with  the  elements  of  \i  on  the  main  diagonal. 

For  a  fixed  number  of  cells,  as  n  — >  oo,  the  ML  estimator  ft  is  asymptotically  normal 
with  mean  and  covariance  matrix  .  Thus,  for  Poisson  sampling,  the  asymptotic 
covariance  matrix 


cov(/3)  =  [*7Diag(/t)*r!.  (9.24) 

Substituting  ML  fitted  values  and  then  taking  square  roots  of  diagonal  elements  yields 
standard  errors  for  ft.  This  also  follows  from  the  general  expression  (4.31)  for  GLMs,  as 
noted  in  Section  4.4.9. 


9.6.8  Connection  Between  Multinomial  and  Poisson  Loglinear  Models 

Similar  asymptotic  results  hold  with  multinomial  sampling.  When  {T,,  /  =  1, ....  N]  are 
independent  Poisson  random  variables,  the  conditional  distribution  of  {T, )  given  n  =  T, 
is  multinomial  with  parameters  {tr,  =  M;  /(Ha  /za)).  Birch  ( 1 963)  showed  that  ML  estimates 
of  loglinear  model  parameters  are  the  same  for  multinomial  sampling  as  for  independent 
Poisson  sampling.  He  showed  that  estimates  are  also  the  same  for  independent  multinomial 
sampling,  as  long  as  the  model  contains  a  term  for  the  marginal  distribution  fixed  by  the 
sampling  design.  To  illustrate,  suppose  that  at  each  combination  of  categories  of  X  and  Z, 
an  independent  multinomial  sample  occurs  on  Y.  Then,  {«;+;)  are  fixed.  The  model  must 
contain  Azz,  so  the  fitted  values  satisfy  {m,+*  =  «,+*}. 

That  separate  inferential  theory  is  unnecessary  for  multinomial  loglinear  models  follows 
from  the  following  argument.  Express  the  Poisson  loglinear  model  for  {/z,  }  as 


log  li,  =  A  +  X;P, 

where  (1,  Xj)  is  row  /  of  the  model  matrix  X  and  (A,  ft1)1  is  the  model  parameter  vector. 
The  Poisson  log  likelihood  is 

L  =  L( A,  P)  =  2^  «/  *og  Mi 

i  i 

—  ^n,(^  +  x,P)  -  ^exp(A  +  *,/ 3)  =  n\  +  ^ n,x,fi  -  r, 

i  i  i 


where  r  =  >li  =  Hi  exP(^  +  xift)-  Since  log  r  =  A  +  log[£T  expfx,/})],  this  log  like¬ 
lihood  has  the  form 
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Now  it i  =  //,,/  fxa)  =  exp(A  +  Xj/3)/  [5^u  exp(A  +  jcu/3)],  and  exp(A)  cancels  in  the 
numerator  and  denominator.  Thus,  the  first  term  (in  braces)  on  the  right-hand  side  in  (9.25) 
is  JT  n,  log  7tj,  which  is  the  multinomial  log  likelihood,  conditional  on  the  total  cell  count 
n.  Unconditionally,  n  =  JL  n,  has  a  Poisson  distribution  with  expectation  JT  A6  =  r,  so 
the  second  term  in  (9.25)  is  the  Poisson  log  likelihood  for  n.  Since  /?  enters  only  in  the  first 
term,  the  ML  estimator  /3  and  its  covariance  matrix  for  the  Poisson  log  likelihood  L(X,  fi ) 
are  identical  to  those  for  the  multinomial  log  likelihood.  The  Poisson  loglinear  model  has 
one  more  parameter  (i.e..  A.)  than  the  multinomial  loglinear  model  because  of  the  random 
sample  size. 

For  a  multinomial  sample,  we  show  in  Section  16.4.1  that  the  estimated  covariance 
matrix  of  loglinear  parameter  estimators  is 

cov(P)  =  \XT[Diag(fi)  -  fi(LT /n]X}  '.  (9.26) 

The  intercept  X  from  the  Poisson  model  is  not  relevant,  and  X  for  the  multinomial  model 
deletes  the  column  of  X  pertaining  to  it  in  the  Poisson  model. 

A  similar  argument  applies  with  several  independent  multinomial  samples.  Each  log- 
likelihood  term  is  a  sum  of  components  from  different  samples,  but  the  Poisson  log  like¬ 
lihood  again  decomposes  into  two  parts.  One  part  is  a  Poisson  log  likelihood  for  the 
independent  sample  sizes,  and  the  other  part  is  the  sum  of  the  independent  multinomial 
log  likelihoods.  Palmgren  (1981)  showed  that  conditional  on  observed  marginal  totals  for 
explanatory  variables,  the  asymptotic  covariances  for  estimators  of  parameters  involving 
the  response  are  the  same  as  for  Poisson  sampling.  For  a  single  multinomial  sample,  Palm- 
gren’s  result  implies  that  (9.26)  is  identical  to  (9.24)  with  the  row  and  column  referring  to  X 
deleted.  Birch  (1963),  Goodman  (1970),  and  McCullagh  and  Nelder  (1989,  p.  21 1 )  gave  re¬ 
lated  results.  Lang  (1996c)  gave  an  elegant  discussion  of  connections  between  multinomial 
and  Poisson  models.  His  results  imply  that  the  asymptotic  variance  of  any  linear  contrast 
of  estimated  log  means  within  a  covariate  level  is  identical  for  the  two  models. 

9.6.9  Distribution  of  Probability  Estimators 

For  multinomial  sampling,  the  ML  estimates  of  cell  probabilities  are  A  —  (L/n.  We  next 
give  the  asymptotic  cov(>f ).  Lang  (1996c)  showed  the  asymptotic  covariance  matrix  for  fi 
for  Poisson  sampling  and  its  connection  with  cov(jr). 

The  saturated  model  has  A  =  p,  the  sample  proportions.  Under  multinomial  sampling, 
from  (3.7)  and  (3.8),  their  covariance  matrix  is 

cov(p)  =  [Diag(tr)  —  jr;rr]//2.  (9.27) 

With  /  independent  multinomial  samples  on  a  response  variable  with  J  categories,  n  and  p 
consist  of  I  sets  of  proportions,  each  having  J  —  1  nonredundant  elements.  Then,  co v(p) 
is  a  block  diagonal  matrix.  Each  of  the  independent  samples  has  a  (7  —  1)  x  (J  —  1 )  block 
of  form  (9.27),  and  the  matrix  contains  zeros  off  the  main  diagonal  of  blocks. 

Now  assume  an  unsaturated  model.  Using  the  delta  method  we  show  in  Sections  1 6.2.2 
and  1 6.4. 1  that  A  has  a  large-sample  normal  distribution  about  ji  .  The  estimated  covariance 
matrix  equals 


cov(jt )  =  cov(p)A,[A'rcov(/7)AT]  ]  XTcov(p). 
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For  tables  with  many  cells,  it  is  not  unusual  to  have  a  sample  proportion  of  0  in  a  cell. 
In  this  case  the  ordinary  standard  error  is  0,  which  is  unappealing.  An  advantage  of  fitting 
a  model  is  that  it  typically  has  a  positive  fitted  probability  and  standard  error. 

9.6.10  Proof  of  Uniqueness  of  ML  Estimates 

When  all  {«,  >  0}.  the  ML  estimates  exist  and  are  unique.  To  show  this,  for  simplicity  we 
use  Poisson  sampling.  Suppose  that  the  model  is  parameterized  so  that  X  has  full  rank. 
Birch  (1963)  showed  that  the  likelihood  equations  are  soluble,  by  noting  that  the  kernel  of 
the  Poisson  log  likelihood 


L (M)  =  log  -  Vi) 


has  individual  terms  converging  to  —  oo  as  log(jU,)  — »  ±oo;  thus,  the  log  likelihood  is 
bounded  above  and  attains  its  maximum  at  finite  values  of  the  model  parameters.  It  is 
stationary  at  this  maximum,  since  it  has  continuous  first  partial  derivatives. 

Birch  showed  that  the  likelihood  equations  have  a  unique  solution,  and  the  likeli¬ 
hood  is  maximized  at  that  point.  He  proved  this  by  showing  that  the  matrix  of  values 
{—d2L/dfi,  dfij)  [i.e.,  the  information  matrix  X1  Diag(/r)A']  is  nonsingular  and  nonnega¬ 
tive  definite,  and  hence  positive  definite.  Nonsingularity  follows  from  X  having  full  rank  and 
the  diagonal  matrix  having  positive  elements  [p.,].  Any  quadratic  form  c'  X1  Diag(jti)3fc 
equals  /  xij(:j)]~  —  0,  so  the  matrix  is  also  nonnegative  definite. 

9.6.11  Pseudo  ML  for  Complex  Sampling  Designs 

Many  surveys  have  sampling  designs  employing  stratification  and/or  clustering  and  have 
multiple  stages.  Skinner  and  Vallet  (2010)  noted  that  the  meaning  of  the  parameters  in 
the  ordinary  loglinear  model  log  p  =  X P  then  depends  on  the  sampling  design.  It  is 
more  sensible  to  define  the  model  in  terms  of  population  (rather  than  sample)  expected 
frequencies  p. 

Consider  a  sampling  scheme  in  which  each  subject  in  the  population  who  is  classified 
in  cell  i  is  included  in  the  sample  with  known  probability  7T,,  and  let  it  denote  the  vector  of 
their  values.  Let  ps  denote  the  expected  frequencies  for  the  sample  of  size  n.  These  relate 
to  the  population  expected  frequencies  by  ps-,  =  it,  pi,  so  they  satisfy 

logoff  =  log  (x)  +  xp. 


That  is, 


•og(|Us,)  -  log(rr,)  =  Yx'jPj- 
j 

As  noted  in  Section  4.3.5,  the  adjustment  term,  —  log(7T,),  to  the  log  link  is  called  an  offset. 
(In  Section  9.7.4  we  discuss  further  the  use  of  model  offsets.) 

In  most  sampling  designs,  however,  inclusion  probabilities  are  not  constant  within  cells 
but  vary  among  the  individual  subjects,  and  also  the  ordinary  Poisson  sampling  model  does 
not  apply.  For  each  observation,  a  case  weight  (typically  the  reciprocal  of  the  inclusion 
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probability)  inflates  or  deflates  the  observation’s  influence  according  to  features  of  that 
design.  Adding  the  case  weights  for  subjects  in  a  particular  cell  /  provides  a  total  weighted 
frequency  for  that  cell.  Denote  this  by  .  Skinner  and  Vallet  then  defined  the  pseudo  ML 
estimate  of  ft  as  the  solution  of  the  equations 


XTh  =  XT  ft, 

where  /u,  =  exp (YljxijPj)-  h  can  be  found  using  a  standard  ML  fitting  routine  such  as 
described  in  the  next  section,  treating  the  weighted  frequencies  as  the  data. 

Skinner  and  Vallet  showed  that  the  estimated  covariance  matrix  (9.24)  for  ordinary 
Poisson  sampling  should  be  replaced  by  a  matrix  estimated  by 

cov(j8)  =  [XT  Diag(/t)*r'[*r  VX][XT  Diag (A)*]-1 , 

where  V  is  an  estimator  of  the  covariance  matrix  of  h  that  accounts  for  the  complex 
sampling.  The  matrix  V  can  be  obtained  using  survey  software.2  The  vector  h  can  be 
scaled  by  a  constant  so  the  /),  sum  to  any  total,  such  as  the  overall  sample  size  n,  in  which 
case  [XT  Diag(/f)A’]“l  represents  the  covariance  matrix  when  we  ignore  the  complex 
sampling  design.  Skinner  and  Vallet  (2010)  showed  alternative  methods  and  gave  related 
references. 


9.7  LOGLINEAR  MODEL  FITTING:  ITERATIVE  METHODS  AND 
THEIR  APPLICATION 

When  a  loglinear  model  does  not  have  direct  estimates,  iterative  algorithms  such  as 
Newton-Raphson  can  solve  the  likelihood  equations.  In  this  section  we  also  present  a 
simpler  but  more  limited  method,  iterative  proportional  fitting. 

9.7.1  Newton-Raphson  Method 

For  the  Newton-Raphson  method  (Section  4.6.1),  we  identify  £(/?)  as  the  log  likelihood 
for  Poisson  loglinear  models.  From  (9.20),  let 

up)  =  (£  Xihfih  -£  exp  E  Xihfih 

i  \  h  /  I  \  h 


Then 


dL(fi)  v~ ' 

u  i  =  .  =  2^  n‘xij  -  ^iX'L 


hjk  — 


3  Pj 

3  2Hfi) 
dPjdfa 


=  ~  !MXijXlk, 


2Such  as  Survey  Analysis  in  R. 
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so  that 


u(p  =  ^(w;  -  /iin)x,j  and 


E(0 

XjjXik. 


The  tth  approximation  fiu)  for  fi  derives  from  pU)  through  /t(,)  =  exp(3L^(n).  It  generates 
the  next  value  /Ju+,)  using  (4.45),  which  in  this  context  is 

p{,+X)  =  p{,)  +  [XTDiag(/ix,))Xr'XT(n  - 

This  in  turn  produces  fi(r+l),  and  so  on. 

Alternatively,  can  be  expressed  as 

pu+u  =  -(Hinr'  r{,\  (9.28) 

where  rjn  =  M;n*//[log  /K,jn  +  («,•  —  pL(p)/ pt'p],  The  expression  in  brackets  is  the  first 
term  in  the  Taylor  series  expansion  of  log  «,  at  log  p\n . 

The  iterative  process  begins  with  all  or  with  an  adjustment  such  as  pf]  = 

«,  +  |  if  any  «,  =  0.  Then  (9.28)  produces  /?(l),  and  for  t  >  0  the  iterations  proceed  as 
just  described  with  {«,}.  For  loglinear  models  L(/J)  is  concave,  and  fi(,)  and  )8(,)  usually 
converge  rapidly  to  the  ML  estimates  //,  and  ji  as  t  increases.  The  HU)  matrix  converges 
to  H  =  ArrDiag(//,)A’.  By  (9.24),  the  estimated  large-sample  covariance  matrix  of  )8  is 
—  H~  \  a  by-product  of  the  method. 

As  we  discussed  in  Section  4.6.4  for  GLMs,  (9.28)  has  the  iterative  reweighted  least- 
squares  form 


p(,+ !)  =  (. xTv;'xr'xTv;'z(,) . 

Here,  z(,)  has  elements  n,  =  log  /zjf)  +  («,-  —  and  V,  =  [Diag(/L(,))]_l .  Thus, 

Pi,+])  is  the  weighted  least-squares  solution  for  a  model 

z(n  =  Xp  +  e, 

where  {e, }  are  uncorrelated  with  variances  { 1  //r'n j.  With  {/if1 2 3  =  n,},  PiU  is  the  weighted 
least-squares  estimate  for  model  log(n)  =  X P  +  e. 

9.7.2  Iterative  Proportional  Fitting 

The  iterative  proportional  fitting  (IPF)  algorithm  is  a  simple  method  for  calculating  {/2, }  for 
loglinear  models.  Introduced  by  Deming  and  Stephan  ( 1 940)  and  later  extended  by  Bishop 
(1969)  and  by  Fienberg  (1970a)  for  loglinear  modeling,  it  has  the  following  steps: 

1.  Start  with  {At;0'}  satisfying  a  model  no  more  complex  than  the  one  being  fitted. 
For  instance,  [p,]0'  =  1.0}  are  trivially  adequate. 

2.  By  multiplying  by  appropriate  factors,  adjust  {/x[0)}  successively  to  match  each 
marginal  table  in  the  set  of  minimal  sufficient  statistics. 

3.  Continue  until  the  maximum  difference  between  the  sufficient  statistics  and  their 
fitted  values  is  sufficiently  close  to  zero. 
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We  illustrate  using  model  ( XY ,  XZ,  YZ).  Its  minimal  sufficient  statistics  are  {«,/+).  {«,+*}, 
and  {/!+ja  }-  Initial  estimates  must  satisfy  the  model.  The  first  cycle  of  the  IPF  algorithm  has 
three  steps: 


41 


(0  )nU+ 
V-ijk  (())  ’ 
F- ij+ 


u<2) 

V-ijk 


..(Dn'+* 


41 


..(2)  n+jk 
'  y*  (2)  ■ 


Summing  both  sides  of  the  first  expression  over  k  shows  that  /x^  =  n,y+  for  all  /  and  j. 
After  step  1,  observed  and  fitted  values  match  in  the  XY  marginal  table.  After  step  2,  all 
4+k  —  «;+*>  I>ut  lhe  XY  marginal  tables  no  longer  match.  After  step  3,  all  /x+L  =  n+jk,  but 
the  XY  and  XZ  marginal  tables  no  longer  match.  A  new  cycle  begins  by  again  matching  the 
XY  marginal  tables,  using  and  so  on. 

At  each  step,  the  updated  estimates  continue  to  satisfy  the  model.  For  instance,  step  1 
uses  the  same  adjustment  factor  {n,j+/ at  different  levels  k  of  Z.  Thus,  XY  odds  ratios 
from  different  levels  of  Z  have  ratio  equal  to  I ,  and  the  homogeneous  association  pattern 
continues  at  each  step. 

As  the  cycles  progress,  the  G 2  statistic  comparing  cell  counts  to  the  updated  fit  is 
monotone  decreasing,  and  the  process  must  converge  (Fienberg  1970a,  Haberman  1974a). 
The  IPF  algorithm  produces  ML  estimates  because  it  generates  a  sequence  of  fitted  values 
converging  to  a  solution  that  both  satisfies  the  model  and  matches  the  sufficient  statistics. 
From  Section  9.6. 10,  only  one  such  solution  exists,  and  it  is  ML. 

The  IPF  method  works  even  for  models  having  direct  estimates.  Then,  IPF  normally 
yields  ML  estimates  within  one  cycle  (Haberman  1974a,  p.  197).  We  illustrate  with  the 
model  of  independence.  The  minimal  sufficient  statistics  are  {/?/+)  and  {/?+;).  With  {/xjy0)  = 
1 .0),  the  first  cycle  gives 


(D  _  «))  n>+  ']_i± 

^ ij  (0)  j  • 

F/  + 

(2)  _  (I)  n+j  _  ni  +  n+j 

^ ij  ~  ^ 'j  (1)  —  „ 

n 

The  IPF  algorithm  then  gives  /xj' 1  =  ni+n+j/n  for  all  t  >  2. 

9.7.3  Comparison  of  IPF  and  Newton-Raphson  Iterative  Methods 

The  IPF  algorithm  is  simple  and  easy  to  implement.  It  converges  to  the  ML  fit  even  when 
the  likelihood  is  poorly  behaved,  for  instance,  with  zero  fitted  counts  and  estimates  on 
the  boundary  of  the  parameter  space.  The  Newton-Raphson  method  is  more  complex, 
requiring  solving  a  system  of  equations  at  each  step.  Newton-Raphson  is  sometimes  not 
feasible  when  the  model  is  of  high  dimensionality — for  instance,  when  the  contingency 
table  and  parameter  vector  are  huge. 

However,  IPF  has  disadvantages.  It  is  applicable  primarily  to  models  for  which  like¬ 
lihood  equations  equate  observed  and  fitted  counts  in  marginal  tables.  By  contrast, 
Newton-Raphson  is  a  general-purpose  method  that  can  solve  more  complex  likelihood 
equations.  IPF  sometimes  converges  slowly  compared  with  Newton-Raphson.  Unlike 
Newton-Raphson,  IPF  does  not  produce  the  model  parameter  estimates  and  their  esti¬ 
mated  covariance  matrix  as  a  by-product.  Fitted  values  that  IPF  produces  can  generate  this 
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information.  Model  parameter  estimates  are  contrasts  of  {log  /l, }  (see  Exercises  9.16  and 
9. 17),  and  substituting  fitted  values  into  (9.24)  yields  cov(/J). 

Because  the  Newton-Raphson  algorithm  applies  to  a  wide  variety  of  models  and  also 
yields  standard  errors,  it  is  the  fitting  routine  used  by  most  software  for  loglinear  models. 
IPF  is  primarily  of  historical  interest.  However,  for  some  applications  the  analysis  is  more 
transparent  using  IPF.  The  next  example  illustrates. 


9.7.4  Raking  a  Table:  Contingency  Table  Standardization 

Table  9.15  relates  political  party  affiliation  and  political  ideology  for  the  2008  General 
Social  Survey.  To  make  the  pattern  of  association  clearer,  we  standardized  the  table  so 
that  all  row  and  column  marginal  totals  equal  100  while  maintaining  the  sample  odds  ratio 
structure. 

The  IPF  routine  to  standardize  with  margins  of  1 00  is 


(0) 

M/y  =  n>] 


and  then  for  t  =  1,3,5,.... 


rf?  =  M 


R-i) 


100 

(/- 1)  ’ 

M,+ 


(f+i) 


(n  100 


At  the  end  of  each  odd-numbered  step,  all  row  totals  equal  100.  At  the  end  of  each  even- 
numbered  step,  all  column  totals  equal  100.  Odds  ratios  do  not  change  at  each  odd  (even) 
step,  since  all  counts  in  a  given  row  (column)  multiply  by  the  same  constant. 

The  IPF  algorithm  converges  to  the  entries  in  parentheses  in  Table  9.15.  The  association  is 
clearer  in  this  standardized  table.  A  ridge  appears  down  the  main  diagonal,  with  Republicans 
having  more  conservative  political  ideology.  The  other  counts  fall  away  smoothly  on  both 
sides. 

Table  standardization  is  a  useful  method  for  comparing  tables  having  different  marginal 
structures.  Mosteller  (1968)  compared  intergenerational  occupational  mobility  tables  from 
Britain  and  Denmark.  Yule  (1912)  compared  three  hospitals  on  vaccination  and  recovery 


Table  9.15  Marginal  Standardization  of  Political  Ideology  by 
Political  Party  Affiliation 


Party 

Affiliation 

Political  Ideology 

Total 

Liberal 

Moderate 

Conservative 

Democrat 

306 

279 

1 16 

(55.0) 

(32.5) 

(12.5) 

(100) 

Independent 

185 

312 

194 

(36.7) 

(40.1) 

(23.2) 

(100) 

Republican 

26 

134 

338 

(8.2) 

(27.5) 

(64.3) 

(100) 

Total 

(100) 

(100) 

(100) 
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for  smallpox  patients.  A  modern  application  is  adjusting  sample  data  to  match  marginal 
distributions  specified  by  census  results. 

The  process  of  table  standardization  is  called  raking  the  table.  Imrey  et  al.  (1981)  and 
Little  and  Wu  (1991)  derived  the  asymptotic  covariance  matrix  for  raked  sample  propor¬ 
tions.  For  sample  counts  {«,)}  with  =  £(rty)},  let  { £,, )  denote  expected  frequencies  for 
the  standardized  table  and  {£,,(  fitted  values  in  the  standardized  table.  The  standardization 
process  corresponds  to  fitting  the  model  , 


log (Eij/iiij)  -  X  +  kf  +  XYj  . 

That  is,  maintaining  the  odds  ratios  means  that  the  two-way  tables  of  {£,;//z;/}  and  of 
{Eij/rijj}  satisfy  independence. 

The  fitted  values  {£;,}  in  the  standardized  table  satisfy 

log  Ejj  -  log  n,i  =  X  +  if  +  XYj, 

with  offset  —  log  nt].  Standard  GLM  software  can  fit  models  having  offsets.  To  rake  a  table, 
one  enters  as  sample  data  pseudo-values  that  satisfy  independence  and  have  the  desired 
margins,  taking  log  ny  as  an  offset. 


NOTES 

Section  9.2:  Loglinear  Models  for  Independence  and  Interaction  in  Three-Way  Tables 

9.1  Early  loglinear:  Roy  and  Mitra  (1956)  discussed  types  of  independence  for  three-way  tables 
and  their  large-sample  tests.  Birch's  (1963)  article  on  ML  estimation  for  loglinear  models  was 
part  of  substantial  research  on  loglinear  models  in  the  1960s.  much  due  to  L.  A.  Goodman  (see 
Section  17.4).  Haberman  (1974a)  presented  an  influential  theoretical  study  of  loglinear  models. 


Section  9.3:  Inference  for  Loglinear  Models 

9.2  Decomposable  models:  Decomposable  models  were  studied  by  Andersen  ( 1 974)  and  Sundberg 
( 1 975),  building  on  earlier  results  by  Goodman  ( 1 970,  1971b)  and  Haberman  ( 1 974a,  Chap.  5), 
who  proved  conditions  under  which  loglinear  models  have  direct  estimates.  For  later  related 
work,  see  Darroch  et  al.  (1980),  Dobra  (2003),  Dobra  and  Fienberg  (2000),  Lauritzen  (1996), 
and  Whittaker  (1990,  Sec.  12.4).  Meeden  et  al.  (1998)  showed  that  with  squared  error  loss, 
the  ML  estimator  of  cell  probabilities  for  any  decomposable  loglinear  model  is  admissible. 
Baglivo  et  al.  (1992)  and  Forster  et  al.  (1996)  discussed  small-sample  exact  inference.  See  also 
Section  10. 1 .2. 

9.3  Measurement  error:  For  methods  that  allow  for  misclassification  error,  see  Espeland  and  Hui 
(1987),  Kuha  and  Skinner  (1997),  Kuha  et  al.  (2005)  and  references  therein,  and  Palmgren  and 
Ekholm  (1987).  For  the  related  issue  of  measurement  error,  see  Buonaccorsi  (2010,  Chap.  2, 
3,  7)  and  Cox  and  Snell  (1989,  Sec.  3.4). 


Section  9.7:  Loglinear  Model  Fitting:  Iterative  Methods  and  Their  Application 

9.4  IPF:  Darroch  (1962)  used  IPF  to  obtain  ML  estimates  in  contingency  tables.  Bishop  et  al. 
(1975),  Fienberg  (1970a),  and  Speed  (2005)  presented  other  applications  of  IPF.  Darroch  and 
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Ratcliff  (1972)  generalized  IPF  for  models  in  which  sufficient  statistics  are  more  complex  than 
marginal  tables. 

9.5  Raking:  For  further  discussion  of  table  raking,  see  Bishop  et  a).  (1975,  pp.  99-102),  Deville 
et  al.  (1993),  Ftaberman  (1979,  Chap.  9),  Floem  (1987),  Imrey  et  al.  (1981),  and  Little  and  Wu 
(1991). 


EXERCISES 

Applications 

9.1  A  General  Social  Survey  asked:  “Do  you  support  or  oppose  the  following  measures 
to  deal  with  AIDS?  (1)  Have  the  government  pay  all  of  the  health  care  costs  of 
AIDS  patients;  (2)  Develop  a  government  information  program  to  promote  safe  sex 
practices,  such  as  the  use  of  condoms.”  Table  9. 1 6  summarizes  opinions  about  health 
care  costs  ( H )  and  the  information  program  (/),  classified  also  by  the  respondent's 
gender  (G).  Fit  loglinear  models  ( GH ,  GI),  ( GH ,  HI),  ( Gl ,  HI),  and  (GH,  GI,  HI). 
Show  that  models  that  lack  the  HI  term  fit  poorly.  Interpret  results  for  the  model 
(GH,  GI,  HI). 


Table  9.16  Data  for  Exercise  9.1  on  Measures  for 
Dealing  with  AIDS 


Gender 

Information 

Opinion 

Health  Opinion 

Support 

Oppose 

Male 

Support 

76 

160 

Oppose 

6 

25 

Female 

Support 

114 

181 

Oppose 

1  1 

48 

9.2  Table  9. 17  shows  the  result  of  cross-classifying  a  sample  of  people  from  the  MBTI 
Step  II  National  Sample,  collected  and  compiled  by  CPP,  Inc.,  on  the  four  scales  of  the 
Myers-Briggs  personality  test:  Extroversion/Introversion  (E/I),  Sensing/iNtuitive 
(S/N),  Thinking/Feeling  (T/F),  and  Judging/Perceiving  (J/P).  The  16  cells  in  this  ta¬ 
ble  correspond  to  the  personality  types.  Fit  the  loglinear  model  of  homogeneous  asso¬ 
ciation.  Based  on  the  fit,  show  that  the  estimated  conditional  association  is  strongest 
between  the  S/N  and  J/P  scales  and  that  there  is  not  strong  evidence  of  conditional 
association  between  the  E/I  and  T/F  scales  or  between  the  E/I  and  J/P  scales. 


Table  9.17  Data  on  Four  Scales  of  Myers-Briggs  Personality  Test 

Extroversion/Introversion  E  I 

Sensing/iNtuitive  S  N  S  N 

Thinking/Feeling 

Judging/Perceiving  TFTF  T  FTF 

J  77  106  23  31  140  138  13  31 

P  42  79  18  80  52  106  35  79 

Source:  Reproduced  with  special  permission  of  CPP,  Inc.,  Mountain  View,  CA  94043. 
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Table  9.18  Software  Output  (Based  on  SAS)  for  Fitting  a  Loglinear  Model  to  Table  9.17 


Criteria  For  Assessing  Goodness  Of  Fit 
Criterion  DF  Value 

Deviance  7  12.3687 

Pearson  Chi-Square  7  12.1996 


Analysis  Of  Parameter  Estimates 


Standard 

LR  95%  Confidence 

Wald  C 

Parameter 

DF 

Estimate 

Error 

Limits 

Square 

EI*SN 

e 

n 

1 

0.3219 

0.1360 

0 . 0553 

0.5886 

5.60 

SN*TF 

n 

f 

1 

0.4237 

0.1520 

0 . 1278 

0 . 7242 

7.77 

SN*JP 

n 

j 

1 

-1.2202 

0.1451 

-1.5075  - 

0 . 9382 

70.69 

TF*JP 

f 

j 

1 

-0.5585 

0.1350 

-0.8242  - 

0.2948 

17.12 

9.3  Refer  to  the  previous  exercise.  Table  9. 1 8  shows  the  fit  of  the  model  that  assumes 
conditional  independence  between  E/I  and  T/F  and  between  E/I  and  J/P  but  has  the 
other  pairwise  associations. 

a.  Compare  this  to  the  fit  of  the  model  containing  all  the  pairwise  associations, 
which  has  deviance  10.16  with  df  =  5.  What  do  you  conclude? 

b.  Show  how  to  use  the  limits  reported  to  construct  a  95%  profile  likelihood  con¬ 
fidence  interval  for  the  conditional  odds  ratio  between  the  S/N  and  J/P  scales. 
Interpret. 

c.  SAS  (PROC  GENMOD)  reports  maximized  log-likelihood  values  of  3475.19 
for  the  mutual  independence  model,  3538.05  for  the  homogeneous  association 
model,  and  3539.58  for  the  model  containing  all  the  three- factor  interaction  terms. 
Write  the  loglinear  model  for  each  case,  and  show  that  the  numbers  of  model 
parameters  are  5,  11,  and  15,  so  residual  df  =  11,5,  and  1 . 

d.  According  to  AIC,  which  model  in  (c)  seems  best?  Why? 


9.4  Refer  to  Section  9.3.2.  Explain  why  software  for  which  parameters  sum  to  zero 
across  levels  of  each  index  reports  A.'j'f  =  A/Jf  =  0.514  and  =  —0.514, 

with  SE  =  0.044  for  each  term. 


9.5  Subjects  in  a  GSS  were  asked  their  opinions  about  government  spending  on  the 
environment  (E),  health  (//),  assistance  to  big  cities  (C),  and  law  enforcement  (L). 
The  data  are  shown  at  the  text  website,  with  outcome  categories  1  =  too  little, 
2  =  about  right,  3  =  too  much.  For  the  homogeneous  association  model.  Table  9.19 
shows  some  results,  including  the  two-factor  estimates  for  the  EH  association  for 
coding  by  which  estimates  at  category  3  of  each  variable  equal  0. 

a.  Test  the  model  goodness  of  fit,  and  interpret. 

b.  Report  the  estimated  EH  conditional  odds  ratio  for  the  (i)  too  much  and  too  little 
categories,  (ii)  too  much  and  about  right  categories,  and  (iii)  about  right  and  too 
little  categories. 

c.  Table  9.20  reports  {A™}  when  parameters  sum  to  zero  within  rows  and  within 
columns,  and  when  parameters  are  zero  in  the  first  row  and  first  column.  Show 
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Table  9.19  Software  Output  (Based  on  SAS)  for  Fitting  Model  For  Exercise  9.5 


Criteria  For  Assessing  Goodness  Of  Fit 


Criterion 

DF 

Value 

Value/DF 

Deviance 

48 

31.6695 

0  .6598 

Pearson  Chi-Square 

48 

26.5224 

0 . 5526 

Log  Likelihood 

1284 . 9404 

Standard  Wald  95%  Chi- 


Parameter 

DF 

Estimate 

Error 

Confidence 

Limits 

Square 

e*h 

111 

2 . 1425 

0 . 5566 

1 . 0515 

3.2335 

14 . 81 

e*h 

12  1 

1.4221 

0 . 6034 

0.2394 

2 .6049 

5 . 55 

e*h 

>  1  1 

0 . 7294 

0.5667 

-0.3813 

1.8402 

1 . 66 

e*h 

>  2  1 

0.3183 

0.6211 

-0.8991 

1 . 5356 

0.26 

Table  9.20  Parameter  Estimates  for  Model  in  Exercise  9.5 

Sum  to  Zero  Constraints 

Zero  for  First  Level 

H 

H 

E 

1 

2 

3 

1  2 

3 

1 

0.509 

0.166 

-0.676 

0  0 

0 

2 

-0.065 

-0.099 

0.163 

0  0.309 

1.413 

3 

-0.445 

-0.068 

0.513 

0  0.720 

2.142 

how  these  yield  the  estimated  EH  conditional  odds  ratio  for  the  too  much  and  too 
little  categories.  Construct  a  confidence  interval  for  that  odds  ratio.  Interpret. 

9.6  For  2010  General  Social  Survey  data  cross-classifying  opinions  on  A  —  abortion 
should  be  legal  for  any  reason  ( 1  =  yes.  0  =  no).  E  =  willingness  to  pay  higher  taxes 
to  help  the  environment  ( 1  =  yes,  0  =  no),  and  P  =  political  party  identification 
(1  —  Democratic,  0  =  Republican),  the  8  cell  counts  for  (A,  E,  P)  values  were  50 
for  ( 1 , 1 , 1 ),  5  for  ( 1 , 1 ,0),  28  for  (1,0,1 ),  30  for  ( 1 .0,0),  32  for  (0, 1 , 1 ),  27  for  (0, 1 ,0), 
50  for  (0,0,1),  61  for  (0,0,0).  Analyze  these  data  using  loglinear  models.  (We  do 
not  list  the  counts  here  for  Independents  or  for  those  who  were  neutral  about  higher 
taxes  for  the  environment.) 

9.7  Table  9.21  refers  to  automobile  accident  records  in  Florida. 


Table  9.21  Data  for  Exercise  9.7 


Safety  Equipment 
in  Use 

Whether 

Ejected 

Injury 

Nonfatal 

Fatal 

Seat  belt 

Yes 

1.105 

14 

No 

411.1 1  1 

483 

None 

Yes 

4,624 

497 

No 

157,342 

1,008 

Source;  Florida  Department  of  Highway  Safety  and  Motor  Vehicles. 
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a.  Find  a  loglinear  model  that  describes  the  data  well.  Interpret  associations. 

b.  Treating  whether  killed  as  the  response,  fit  an  equivalent  logistic  model.  Interpret 
the  effects. 

c.  Since  n  is  large,  goodness-of-fit  statistics  are  large  unless  the  model  fits  very  well. 
Calculate  the  dissimilarity  index  for  the  model  in  part  (a),  and  interpret. 

9.8  Refer  to  the  loglinear  models  for  the  auto  accident  data  of  Table  9.8. 

a.  Explain  why  the  fitted  odds  ratios  in  Table  9.10  for  model  ( Gl ,  GL,  GS,  IL,  IS, 
LS )  suggest  that  the  most  likely  accident  case  for  injury  is  females  not  wearing 
seat  belts  in  rural  locations. 

b.  Fit  model  ( GLS ,  Gl,  IL,  IS).  Using  model  parameter  estimates,  show  that  the 
fitted  IS  conditional  odds  ratio  equals  0.44,  and  show  that  for  each  injury  level, 
the  estimated  conditional  LS  odds  ratio  is  1.17  for  (G  =  female)  and  1.03  for 
( G  =  male). 

c.  Consider  the  following  two-stage  model:  The  first  stage  is  a  logistic  model  with  S 
as  the  response  for  the  three-way  GLS  table.  The  second  stage  is  a  logistic  model 
with  these  three  variables  as  predictors  for  I  in  the  four-way  table.  Explain  why 
this  composite  model  is  sensible,  fit  the  models,  and  interpret  results. 

9.9  Refer  to  the  logistic  model  in  Exercise  5.17  on  the  death  penalty. 

a.  Give  the  symbol  for  the  loglinear  model  that  is  equivalent  to  this  logistic  model. 

b.  Which  logistic  model  corresponds  to  loglinear  model  (YD,  YV,  DVF)1 

c.  State  the  equivalent  loglinear  and  logit  models  for  which  (i)  Y  is  jointly  indepen¬ 
dent  of  D,  V,  and  F;  (ii)  there  are  main  effects  of  F  on  T,  but  Y  is  conditionally 
independent  of  D  and  V,  given  F;  and  (iii)  there  is  interaction  between  D  and  V 
in  their  effects  on  Y,  and  F  has  main  effects. 

9.10  For  a  multiway  contingency  table,  when  is  a  logistic  model  more  appropriate  than  a 
loglinear  model?  When  is  a  loglinear  model  more  appropriate? 

9.11  Using  software,  conduct  the  analyses  described  in  this  chapter  for  the  student  survey 
data  (Table  9.3). 

9.12  Using  table  raking,  standardize  Table  1 1 .7.  Describe  the  migration  patterns. 

9.13  The  book’s  website  (www.stat.ufl.edu/~aa/cda/cda.html)  has  a  2  x  3  x 
2x2  table  relating  responses  on  frequency  of  attending  religious  services,  political 
views,  opinion  on  making  birth  control  available  to  teenagers,  and  opinion  about  a 
man  and  woman  having  sexual  relations  before  marriage.  Analyze  these  data  using 
loglinear  models.  Interpret  results. 


Theory  and  Methods 

9.14  Suppose  that  {//,,  =  «7r,y}  satisfy  the  independence  model  (9.1). 

a.  Show  that  XY  —  XYh  =  \og(n+a/n+h). 

b.  Show  that  {all  XY  —  0}  is  equivalent  to  n+j  =  1//  for  all  j. 
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9.15  Refer  to  the  independence  model,  /x,,  =  /t icejfij.  For  the  corresponding  loglinear 
model  (9.1 ): 

a.  Show  that  you  can  constrain  E,  if  —  E/  k)  =  0  by  setting 


kj  =  log  ft,  - 


if  =  log  otj  -  ^log«*)  j l' 
k  —  log  [i  +  logoff  j  /  +  ^^logyS,,^  j  J. 


Xiiog  Pi^j  j  ■> 


b.  Show  that  you  can  constrain  if  =  if  =  0  by  defining  if  =  log  a,  —  logai  and 
if  =  log  fij  —  log  Then,  what  does  1  equal? 


9.16  For  an  /  x  J  table,  let  r],j  =  log  /x„,  and  let  a  dot  subscript  denote  the  mean  for  that 

index  (e.g.,  t],  =  E;  Vij/J)-  Then,  !et  k  =  h  ,  if  =  77,.  -  77..,  if  =  77.7  -  77 .,  and 

;-f  =  'in  ~  'h.  -  nj  + 1?... 

a.  Show  that  log  /x,;,  =  1  +  if  +  if  +  lfK.  Hence,  any  set  of  positive  {yu.,,  }  satisfies 
the  saturated  model. 

b.  Show  that  E,  *f  -  Ey  krj  =  E,  kf  =  Ey  =  0. 

c.  For  2x2  tables,  show  that  log#  =  4lff . 

d.  For  2  x  J  tables,  show  that  iff  =  (E,  l°ga/)  /2J <  where  otj  —  /X| iM2y/ 

M2iMi./>  y  =  2, . /. 

e.  Alternative  constraints  have  other  odds  ratio  formulas.  Let  1  =  7711,  if  = 

>7/i  -  >711,  =  'I t y  -  >7n,  and  lfK  =  77, y  -  77,- ,  -  77,7  +  77, , .  Then,  show  that 

the  saturated  model  holds  with  if  =  if  =  iff  =  iff  =  0  for  all  i  and  j,  and 
k*/  =  l0g(,lX||/U,y//lX|yM,i)- 

9.17  Suppose  that  all  7X7,7  >  0.  Let  77,^  =  log  /x,,7,  and  consider  model  parameters  with 

zero-sum  constraints. 

a.  For  the  general  loglinear  model  (9. 12),  define  parameters  in  the  fashion  of  Exer¬ 
cise  9.16  (e.g.,  lfK  =  77,,  -  77,  -  77.,.  +  77  ). 

b.  For  model  (AY,  XZ,  FZ)  with  a  2  x  2  x  2  table,  show  that  iff  =  f  log  #1  n*). 

c.  For  ( AYZ )  with  a  2  x  2  x  2  table,  show  that 

Ufff  =  ji  log[6?i  Hl)/01 1(2)]- 


9.18  Two  balanced  coins  are  flipped,  independently.  Let  X  =  whether  the  first  flip 
resulted  in  a  head  (yes,  no),  Y  =  whether  the  second  flip  resulted  in  a  head,  and 
Z  =  whether  both  flips  had  the  same  result.  Using  this  example,  show  that  marginal 
independence  for  each  pair  of  three  variables  does  not  imply  that  the  variables  are 
mutually  independent. 
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9.19  For  three  categorical  variables  X,  Y,  and  Z: 

a.  When  Y  is  jointly  independent  of  X  and  Z,  show  that  X  and  Y  are  conditionally 
independent,  given  Z. 

b.  Prove  that  mutual  independence  of  X,  Y ,  and  Z  implies  that  X  and  Y  are  both 
marginally  and  conditionally  independent. 

c.  When  X  is  independent  of  Y  and  Y  is  independent  of  Z,  explain  why  X  is  not 
necessarily  independent  of  Z. 

9.20  Suppose  X  and  Y  are  conditionally  independent,  given  Z,  and  X  and  Z  are  marginally 
independent.  Show  that  X  and  Y  are  also  marginally  independent. 

9.21  A  2  x  2  x  2  table  satisfies  n-l++  =  n+i+  =  7r++k  =  4,  all  i,j,  k.  Give  an  example  of 
[k^}  that  satisfies  model  (a)  (X,  Y,  Z),  (b)  (XY,  Z),  (c)  (XY,  YZ),  (d)  (XY,  XZ,  YZ), 
and  (e)  (XYZ),  but  in  each  case  not  a  simpler  model. 

9.22  Suppose  model  (XY,  XZ,  YZ)  holds  in  a  2  x  2  x  2  table,  and  the  common  XY  condi¬ 
tional  log  odds  ratio  at  the  two  levels  of  Z  is  positive.  If  the  XZ  and  YZ  conditional 
log  odds  ratios  are  both  positive  or  both  negative,  show  that  the  XY  marginal  odds 
ratio  is  larger  than  the  XY  conditional  odds  ratio.  Hence,  Simpson’s  paradox  cannot 
occur  for  the  XY  association. 

9.23  Show  that  the  general  loglinear  model  in  T  dimensions  has  2r  terms.  [Hint:  It  has 

an  intercept,  single-factor  terms,  ^  two-factor  terms, _ ] 

9.24  Each  of  T  responses  is  binary.  For  indicator  variables  {z i , . . . ,  zj],  the  loglinear 
model  of  mutual  independence  has  the  form 

log  Mr, . r,  —  ^|2|  H - F  Z7-Z7-. 

Show  how  to  express  the  general  loglinear  model  (Cox  1972). 

9.25  Consider  a  cross-classification  of  W,  X,  Y,  Z. 

a.  Explain  why  (WXZ,  WYZ)  is  the  most  general  loglinear  model  for  which  X  and  Y 
are  conditionally  independent. 

b.  State  the  model  symbol  for  which  X  and  Y  are  conditionally  independent  and 
there  is  no  three-factor  interaction. 

9.26  For  a  four-way  table  with  binary  response  Y,  give  the  equivalent  loglinear  and  logistic 
models  that  have  main  effects  of  factors  A,  B,  and  C  on  Y  when  (a)  Y  is  binary  and 
(b)  Y  has  J  >  2  categories. 

9.27  For  the  independence  model  for  a  two-way  table,  derive  minimal  sufficient  statistics, 
likelihood  equations,  fitted  values,  and  residual  df. 

9.28  For  the  loglinear  model  for  an  /  x  J  table,  log  m,/  —  X  +  kf ,  show  that  /!,,  =  rij+/J 
and  residual  df  =  l(J  —  1). 
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9.29  Write  the  log  likelihood  L  for  model  (XZ,  YZ).  Calculate  dL/dk  and  show  that 
it  implies  £+++  =  n.  Show  that  dL/dkf  =  n,++  —  M/++-  Similarly,  differentiate 
with  respect  to  each  parameter  to  obtain  likelihood  equations.  Show  (9.22)  and 
(9.23)  imply  the  other  equations,  so  those  equations  determine  the  ML  estimates. 


9.30  Consider  the  loglinear  model  with  symbol  (XZ,  YZ). 

a.  For  fixed  k,  show  that  {Ayr}  equal  the  fitted  values  for  testing  independence 
between  X  and  Y  within  level  k  of  Z. 

b.  Show  that  the  Pearson  and  likelihood-ratio  statistics  for  testing  this  model’s  fit 

have  form  X2  =  X\,  where  X2  tests  independence  between  X  and  Y  at  level  k 

of  Z. 


9.31  Table  9.22  shows  fitted  values  for  models  for  four-way  tables  that  have  direct 
estimates. 

a.  Use  Birch’s  results  to  verify  that  the  entry  is  correct  for  ( W ,  X,  Y,  Z).  Verify  its 
residual  df. 

b.  Motivate  the  estimate  and  df  formulas  for  (WX,  YZ),  (WXY,  Z),  (WXY,  WZ),  and 
(WXY,  WXZ)  using  composite  variables  and  the  corresponding  results  for  two- 
way  tables  [e.g.,  for  (WXY,  WZ),  given  W,  Z  is  independent  of  the  composite  XY 
variable]. 


Table  9.22  Fits  of  Four- Way-Table  Loglinear  Models  for  Exercise  9.31“ 


Model 

Expected  Frequency  Estimate 

Residual  DF 

(W,  X,  Y,  Z) 

«a+++«+, ++«++,+"+++*/" 3 

HIJK  -  H-l-J-K+3 

(WX,  Y,  Z) 

nhi++n++j+n+++k/n2 

HIJK  -HI  -  J  -  K  +  2 

(WX,  WY,  Z) 

nhi++Hh+j+n+++k/ nh+++n 

HIJK  -HI-HJ-K  +  H  +  1 

(WX,  YZ) 

nhi++n++jk/n 

(HI  -  \)(JK  -  1) 

(WX,  WY,  XZ) 

nhi++nh+j+n+i+k/  nh+++n+i++ 

HIJK  -HI-HJ-IK  +  H  +  1 

(WX,  WY,  WZ) 

nhi++nh+j+nh++k/(nh+++)2 

HIJK  -HI  -HJ  -HK  +  2H 

(WXY,  Z) 

nhij+n+++k/ H 

(HIJ  -  ])(*"  -  1) 

(WXY,  WZ) 

nhij+nh++k/ nh+++ 

H(IJ  -  1  )(K  -  1) 

(WXY,  WXZ) 

M  hij+  ft  hi+k  /  ft  hi++ 

HI(J  -  \)(K  -  1) 

"Number  of  levels  of  W,  X,  Y,  Z,  denoted  by  H,  /,  J,  K.  Estimates  for  other  models  of  each  type  are  obtained  by 
symmetry. 


9.32  A  7" -dimensional  table  {nab  ,},  has  /,  categories  in  dimension  /. 

a.  For  the  mutual  independence  model,  explain  why  the  minimal  sufficient  statistics 
are  the  one-way  marginal  distributions,  the  fitted  probabilities  are  the  product 
of  the  T  one-dimensional  marginal  proportions,  and  residual  df  =  11,  A  -  [1  + 

£I(//-l)]  =  ri//-EI//+7'-l. 

b.  For  the  hierarchical  homogeneous  association  model  having  all  two-factor  as¬ 
sociations  but  no  three-factor  interactions,  explain  why  the  minimal  sufficient 
statistics  are  all  the  two-factor  marginal  distributions,  and  the  residual  df  = 

n,  h  -  [1  +  £,(/,  -  D  +  £  £,<//«  -  Wj  -  1)]. 
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9.33  Consider  loglinear  model  (X,  Y,Z)  for  a  2  x  2  x  2  table. 

a.  Express  the  model  in  the  form  log  fi  —  Xf}. 

b.  Show  that  the  likelihood  equations  X1  n  =  Xr  fl  equate  {n^}  and  {/!,,/.}  in  the 
one-dimensional  margins. 

9.34  Apply  IPF  to  model  (a)  (X,  YZ)  and  (b)  (XZ,  YZ).  Show  that  the  ML  estimates  result 
within  one  cycle. 

9.35  Refer  to  Section  9.6.3.  Show  that  L  has  individual  terms  converging  to  — oo  as 
log  /x,  -»  ±oo.  Explain  why  positive  definiteness  of  the  information  matrix  implies 
that  the  solution  of  the  likelihood  equations  is  unique,  with  likelihood  maximized  at 
that  point. 


CHAPTER  10 


Building  and  Extending 
Loglinear  Models 


From  Chapter  9,  loglinear  models  for  contingency  tables  use  the  log  link  for  Poisson  cell 
counts  in  describing  associations  and  interactions  among  a  set  of  categorical  response 
variables,  and  connections  exist  between  them  and  logistic  models.  In  this  chapter  we 
discuss  topics  dealing  with  building  and  extending  loglinear  models. 

In  Section  10. 1  we  show  how  certain  models  having  a  conditional  independence  struc¬ 
ture  can  be  represented  by  graphs.  In  Section  10.2  we  discuss  selection  and  comparison 
of  loglinear  models.  Diagnostics  for  checking  models,  such  as  residuals,  are  presented  in 
Section  10.3.  The  loglinear  models  of  Chapter  9  treat  all  variables  as  nominal.  General¬ 
izations  of  loglinear  models  and  related  association  models  and  correlation  models  can 
also  describe  association  between  ordinal  variables.  One  approach  scores  those  variables 
with  fixed  or  parameter  scores,  as  Sections  10.4  and  10.5  show.  In  Section  10.6  we  cover 
complications  that  can  occur  with  sparse  contingency  tables.  In  the  final  section  we  discuss 
Bayesian  loglinear  modeling. 


10.1  CONDITIONAL  INDEPENDENCE  GRAPHS  AND  COLLAPSIBILITY 

Many  loglinear  models  can  be  portrayed  with  a  graph  that  represents  the  conditional 
independence  structure  among  the  responses.  This  representation  also  helps  to  reveal  im¬ 
plications  of  models,  such  as  when  an  association  is  unchanged  when  we  collapse  a  table 
over  another  variable. 


10.1.1  Conditional  Independence  Graphs 

A  graph ,  according  to  graph  theory,  consists  of  two  sets:  a  set  of  vertices  and  a  set  of  edges 
connecting  some  vertices.  In  a  conditional  independence  graph ,  each  vertex  represents 
a  variable.  The  absence  of  an  edge  connecting  two  variables  represents  a  conditional 
independence  between  them.  For  instance,  loglinear  model  (WX,  WY,  WZ,  YZ)  lacks  XY 
andXZ  terms.  It  assumes  independence  between  X  and  Y  and  between  X  andZ,  conditional 
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X -  W 

Z 

Figure  10.1  Conditional  independence  graphs  for  loglinear  models  ( WX ,  WY,  WZ,  YZ )  and  ( WX ,  WYZ). 


on  the  remaining  two  variables.  Figure  10. 1  portrays  this  model’s  graph.  The  four  variables 
form  the  vertices.  The  four  edges  represent  pairwise  conditional  associations.  Edges  do  not 
connect  X  and  Y  or  connect  X  and  Z,  the  conditionally  independent  pairs. 

Two  loglinear  models  with  the  same  pairwise  associations  have  the  same  conditional 
independence  graph.  For  instance,  the  Figure  10.1  graph  is  also  the  one  for  model  (WX, 
WYZ),  which  adds  a  three-factor  WYZ  interaction. 

A  set  of  properties,  called  Markov  properties,  allows  us  to  deduce  from  the  graph 
the  conditional  independence  structure  between  variables  and  groups  of  variables.  One 
such  property,  the  global  Markov  property,  links  statements  of  conditional  independence 
between  two  variables  or  groups  of  variables  to  the  concept  of  separation  in  the  graph.  A 
path  in  a  conditional  independence  graph  is  a  sequence  of  edges  leading  from  one  variable 
to  another.  Two  variables  (or  groups  of  variables)  X  and  Y  are  said  to  be  separated  by  a 
subset  of  variables  if  all  paths  connecting  X  and  Y  intersect  that  subset.  For  instance,  in 
Figure  10.1,  W  separates  X  and  Y,  since  any  path  connecting  X  and  Y  goes  through  W.  The 
subset  (VT,  Z)  also  separates  X  and  Y .  The  global  Markov  property  states  that  two  variables 
are  conditionally  independent  given  any  subset  of  variables  that  separates  them  (Darroch 
et  al.  1980,  Kreiner  1987).  Thus,  not  only  are  X  and  Y  conditionally  independent  given  W 
and  Z,  but  also  given  W  alone.  Similarly,  X  and  Z  are  conditionally  independent  given  W 
alone.  This  property  is  equivalent  to  a  local  Markov  property  according  to  which  a  variable 
is  conditionally  independent  of  all  other  variables,  given  its  adjacent  neighbors  to  which 
it’s  connected  with  an  edge. 

10.1.2  Graphical  Loglinear  Models 

Darroch  et  al.  (1980)  used  graph  theory  to  represent  hierarchical  loglinear  models  having  a 
conditional  independence  structure.  Those  models,  called  graphical  models,  are  represented 
by  undirected  graphs  such  as  just  shown.  In  the  graph,  a  maximally  connected  subset  is 
called  a  clique.  In  Figure  10.1,  for  example,  the  three  variables  W,  Y,  and  Z  form  a  clique, 
but  any  two  of  those  three  variables  would  not  form  one.  The  second  clique  of  this  graph 
is  XW.  For  a  graphical  model,  the  generating  classes  that  provide  the  sufficient  statistics 
are  the  cliques.  So,  in  Figure  10.1,  the  joint  distribution  for  W,  Y,  and  Z  is  a  sufficient 
statistic,  and  the  graphical  model  corresponding  to  that  graph  is  (XW ,  WY Z)  rather  than 
(XW,  WY,  WZ,  YZ). 

The  family  of  graphical  models  contains  the  family  of  decomposable  models  (Andersen 
1974,  Thm.  5).  Recall  that  for  decomposable  models,  the  joint  distribution  factors  into  a 
product  of  marginal  distributions  and  conditional  distributions,  and  direct  ML  estimates 
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exist.  Not  all  graphical  models  are  decomposable,  however.  An  example  is  the  loglinear 
model  ( WX ,  XY,  YZ ,  ZW),  which  has  the  graphical  appearance  of  a  square.  This  model 
exhibits  two  conditional  independences  (between  W  and  Y  and  between  X  and  Z)  but 
does  not  have  direct  ML  estimates.  Tables  1  and  2  in  Darroch  et  al.  (1980)  portray  all  the 
decomposable  and  nondecomposable  graphical  models  of  dimension  <5. 

10.1.3  Collapsibility  in  Three-Way  Contingency  Tables 

We  have  seen  that  conditional  associations  in  partial  tables  usually  differ  from  marginal 
associations.  Under  the  collapsibility  conditions  given  in  Section  2.3.6,  however,  they 
are  the  same.  For  the  odds  ratio,  the  collapsibility  conditions  relate  to  logistic  models, 
as  we  observed  in  Section  6.4.8,  and  loglinear  models.  Recall  that  in  a  three-way  table, 
XY  marginal  and  conditional  odds  ratios  are  identical  if  either  Z  and  X  are  conditionally 
independent  or  if  Z  and  Y  are  conditionally  independent  (or  both).  These  conditions  occur 
for  loglinear  models  (XY,  YZ),  (XY,  XZ),  and  (XY,  Z). 

The  proof  of  the  collapsibility  conditions  follows  directly  from  the  model  formulas.  We 
illustrate  with  model  (XY,  XZ).  For  it,  the  XY  marginal  table  has 

Pij+  =  ^exp  (A.  +  kf  +  kYj  +  kf  +  kfj Y  +  kfkz) 

k 

=  exP  (*  +  xf  +  x)  +  xf)  X!  exP  +  XikZ)  ■ 

k 


The  loglinear  model  for  that  marginal  table  satisfies 

l°g  Mi/+  =  A  +  A*  +  Ax  +  Axk  + 

where  =  log[  exp(Az  +  A(xz)j  and  can  be  combined  with  the  kf  term  to  give  the 
main  effect  for  X.  Note  that  (AXK }  are  the  same  for  the  marginal  table.  Since  the  XY  odds 
ratios  are  functions  of  (AXK },  they  are  the  same  in  both  marginal  and  partial  tables. 

We  illustrate  for  the  student  survey  data  (Table  9.3)  from  Section  9.2.4,  about  alcohol, 
cigarette,  and  marijuana  use.  Model  (AM,  CM)  specifies  AC  conditional  independence, 
given  M.  It  has  conditional  independence  graph 

A - M - C. 

Consider  the  AM  association.  Since  C  is  conditionally  independent  of  A,  the  AM  fitted 
conditional  odds  ratios  are  the  same  as  the  AM  fitted  marginal  odds  ratio  collapsed  over 
C.  From  Table  9.5,  both  equal  61.9.  Similarly,  the  CM  association  is  collapsible.  The  AC 
association  is  not,  because  M  is  conditionally  dependent  with  both  A  and  C  in  model  (AM, 
CM);  that  is,  an  edge  connects  M  to  both  A  and  C.  Thus,  A  and  C  may  be  marginally 
dependent,  even  though  they  are  conditionally  independent.  In  fact,  from  Table  9.5,  the 
fitted  AC  marginal  odds  ratio  for  this  model  is  2.7  rather  than  1 .0. 

For  model  (AC,  AM,  CM),  no  pair  is  conditionally  independent.  No  collapsibility  con¬ 
ditions  are  fulfilled.  Table  9.5  showed  that  each  pair  has  quite  different  fitted  marginal 
and  conditional  associations  for  this  model.  When  a  model  contains  all  two-factor  effects, 
effects  may  change  after  collapsing  over  any  variable. 
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10.1.4  Collapsibility  for  Multiway  Tables 

Bishop  et  al.  (1975,  p.  47)  provided  a  parametric  collapsibility  condition  for  multiway 
tables: 

Suppose  that  a  model  for  a  multiway  table  partitions  variables  into  three  mutually  exclusive 
subsets,  A.  B,  C ,  such  that  B  separates  A  and  C.  After  collapsing  the  table  over  the  variables 
in  C,  parameters  relating  variables  in  A  and  parameters  relating  variables  in  A  to  variables  in 
B  are  unchanged. 

We  illustrate  using  model  (WX,  WX,  WZ,  XZ)  (Figure  10. 1).  Let  A  =  {X},5  =  { VV7 } ,  and 
C  —  { Y ,  Z}.  Since  the  XY  and  XZ  terms  do  not  appear,  all  parameters  linking  set  A  with  set 
C  equal  zero,  and  B  separates  A  and  C.  If  we  collapse  over  Y  and  Z,  the  WX  association  is 
unchanged.  Next,  identify  A  =  {X,  Z } ,  B  =  (W),  C  =  (X).  Then,  conditional  associations 
among  W,  X,  and  Z  remain  the  same  after  collapsing  over  X. 

This  result  also  implies  that  when  any  variable  is  independent  of  all  other  variables, 
collapsing  over  it  does  not  affect  any  other  model  terms.  In  model  (WX,  WY,  XY,  Z),  for 
instance,  associations  among  W,  X,  and  Y  are  the  same  as  in  model  (WX.  WY,  XX). 

When  the  separating  set  B  contains  more  than  one  variable,  although  parameter  values 
are  unchanged  in  collapsing  over  set  C,  the  ML  estimates  of  those  parameters  may  differ 
slightly.  A  stronger  collapsibility  definition  also  requires  that  the  estimates  be  identical. 
This  condition  of  commutativity  of  fitting  and  collapsing  holds  if  the  model  contains  the 
highest-order  term  relating  variables  in  B  to  each  other;  that  is,  it  is  graphical,  Asmussen 
and  Edwards  (1983)  discussed  this  property,  which  relates  to  decomposability. 


10.2  MODEL  SELECTION  AND  COMPARISON 

Strategies  for  selecting  and  comparing  loglinear  models  are  similar  to  those  for  logistic 
regression  presented  in  Section  6. 1 .  A  model  should  be  complex  enough  to  fit  well  but  also 
relatively  simple  to  interpret,  smoothing  rather  than  overfitting  the  data. 

10.2.1  Considerations  in  Model  Selection 

The  potentially  useful  models  are  usually  a  small  subset  of  the  possible  models.  A  study 
designed  to  answer  certain  questions  through  confirmatory  analyses  may  plan  to  compare 
models  that  differ  only  by  the  inclusion  of  certain  terms.  Also,  models  should  recognize 
distinctions  between  response  and  explanatory  variables.  The  modeling  process  should  con¬ 
centrate  on  terms  linking  responses  and  terms  linking  explanatory  variables  to  responses. 
The  model  should  contain  the  most  general  interaction  term  relating  the  explanatory  vari¬ 
ables.  From  the  likelihood  equations,  this  has  the  effect  of  equating  the  fitted  totals  to 
the  sample  totals  at  combinations  of  their  levels.  This  is  natural,  since  we  normally  treat 
such  totals  as  fixed.  Related  to  this,  certain  marginal  totals  are  often  fixed  by  the  sampling 
design.  Any  potential  model  should  include  those  totals  as  sufficient  statistics,  so  likelihood 
equations  equate  them  to  the  fitted  totals. 

Consider  Table  9.8  with  /  =  automobile  injury  and  S  =  seat-belt  use  as  response  vari¬ 
ables.  Since  G  =  gender  and  L  =  location  are  explanatory  variables,  we  treat  {nv+^+)  as 
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fixed  at  each  combination  for  G  and  L.  For  example,  20,629  women  had  accidents  in  urban 
locations,  so  the  fitted  counts  should  have  20,629  women  in  urban  locations.  To  ensure  this, 
a  loglinear  model  should  contain  the  GL  term,  which  implies  from  its  likelihood  equations 
that  {p,g+e+  =  «£+<?+).  Thus,  the  model  should  be  at  least  as  complex  as  (GL,  S,  /)  and 
focus  on  the  effects  of  G  and  L  on  S  and  /  as  well  as  the  SI  association. 

For  exploratory  studies,  one  approach  first  fits  the  model  having  single-factor  terms, 
then  the  model  having  two-factor  and  single-factor  terms,  then  the  model  having  three- 
factor  and  lower  terms,  and  so  on.  Fitting  such  models  often  reveals  a  restricted  range  of 
good-fitting  models.  In  Section  9.4.2  we  used  this  strategy  with  the  automobile  injury  data 
set.  Automatic  search  mechanisms  among  possible  models,  such  as  backward  elimination, 
may  also  be  useful.  However,  they  should  be  used  with  care  and  skepticism,  as  they  need 
not  yield  a  meaningful  model. 


10.2.2  Example:  Model  Building  for  Student  Survey 

The  study  on  the  use  of  alcohol  (A),  cigarettes  (C),  and  marijuana  ( M )  by  a  sample  of  high 
school  seniors  also  classified  students  by  gender  ( G )  and  race  (/?).  Table  10.1  shows  the 
five-dimensional  contingency  table.  In  selecting  a  model,  we  treat  A,  C,  and  M  as  response 
variables  and  G  and  R  as  explanatory.  Thus,  a  model  should  contain  the  GR  term,  which 
forces  the  GR  fitted  marginal  totals  to  equal  the  sample  marginal  totals. 

Table  10.2  displays  goodness-of-fit  tests  for  several  models.  Because  many  cell  counts 
are  small,  the  chi-squared  approximation  for  G 2  may  be  poor,  but  this  index  is  useful  for 
comparing  models.  The  first  model  listed  contains  only  the  GR  association  and  assumes 
conditional  independence  for  the  other  nine  pairs  of  associations.  It  fits  horribly,  which 
is  no  surprise.  Model  2,  with  all  two-factor  terms,  seems  to  fit  well.  Model  3,  containing 
all  the  three-factor  interaction  terms,  also  fits  well,  but  the  improvement  in  fit  is  not  great 
(difference  in  G 2  of  15.3  -  5.3  =  10.0  based  on  df  =  16  —  6  —  10).  Thus,  we  consider 
models  without  three-factor  terms.  Beginning  with  model  2,  we  eliminate  two-factor  terms. 
We  use  backward  elimination,  sequentially  taking  out  terms  for  which  the  resulting  increase 
in  G 2  is  smallest,  when  refitting  the  model. 

Table  10.2  shows  the  start  of  this  process.  Nine  pairwise  associations  are  candidates 
for  removal  from  model  2  (all  except  GR),  shown  in  models  4a  through  4i.  The  smallest 


Table  10.1  Alcohol,  Cigarette,  and  Marijuana  Use,  by  Gender  and  Race 


Alcohol  Use 

Cigarette  Use 

Marijuana  Use 

Race 

=  White 

Race 

=  Other 

Female 

Male 

Female 

Male 

Yes 

No 

Yes 

No 

Yes 

No 

Yes 

No 

Yes 

Yes 

405 

268 

453 

228 

23 

23 

30 

19 

No 

13 

218 

28 

201 

2 

19 

1 

18 

No 

Yes 

1 

17 

1 

17 

0 

1 

1 

8 

No 

1 

1 17 

1 

133 

0 

12 

0 

17 

Source:  Harry  Khamis,  Wright  State  University. 
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Table  10.2  Goodness-of-Fit  Tests  for  Loglinear  Models  for  Table  10.1 


Model" 

G 2 

X2 

df 

1.  Mutual  independence  +  GR 

1325.14 

1454.14 

25 

2.  Homogeneous  association 

15.34 

18.68 

16 

3.  All  three-factor  terms 

5.27 

4.80 

6 

4a.  (2y-AC 

201.20 

190.60 

17 

4b.  (2 )-AM 

106.96 

108.11 

17 

4c.  (2 )-CM 

513.47 

474.26 

17 

4d.  (2 y-AG 

18.72 

23.14 

17 

4e.  (2 )-AR 

20.32 

30.32 

17 

4f.  (2)-CG 

16.32 

19.16 

17 

4g.  (2 )-CR 

15.78 

20.12 

17 

4h.  (2)-GM 

25.16 

27.97 

17 

4i.  (2 )-MR 

18.93 

22.83 

17 

5.  (AC,  AM,  CM,  AG.  AR,  GM,  GR.  MR) 

16.74 

20.51 

18 

6.  (AC,  AM,  CM,  AG,  AR,  GM,  GR) 

19.91 

23.02 

19 

7.  (AC,  AM,  CM,  AG,  AR,  GR) 

28.81 

32.13 

20 

“G,  gender;  R,  race;  A,  alcohol  use;  C.  cigarette  use;  M,  marijuana  use. 


increase  in  G2,  compared  with  model  2,  occurs  in  removing  the  CR  term  (i.e.,  model  4g). 
The  increase  is  15.78  —  15.34  =  0.44,  with  df  =  17  -  16  =  1,  so  this  elimination  seems 
sensible.  After  removing  it,  the  smallest  additional  increase  results  from  removing  the  CG 
term  (model  5),  resulting  in  G2  =  1 6.74  withdf  =  1 8.  Removing  next  the  MR  term  (model 
6)  yields  G2  =  19.91  with  df  =  19. 

Further  removals  have  a  more  severe  effect.  For  instance,  removing  the  AG  term  increases 
G2  by  5.26,  with  df  =  1 .  Ordinary  F-values  do  not  apply  for  such  statistics,  since  the  data 
suggested  these  tests,  but  it  seems  safest  not  to  drop  additional  terms.  [See  Westfall  and 
Wolfinger  (1997)  and  Westfall  and  Young  (1993)  for  methods  of  adjusting  R-values  to 
account  for  multiple  tests.]  Model  6,  denoted  by  (AC,  AM.  CM.  AG.  AR.  GM.  GR),  has 
conditional  independence  graph 


Every  path  between  C  and  (G,  R }  involves  a  variable  in  [A,  M).  Given  the  outcome  on 
alcohol  use  and  marijuana  use,  the  model  states  that  cigarette  use  is  independent  of  both 
gender  and  race.  Collapsing  over  the  explanatory  variables  race  and  gender,  the  conditional 
associations  between  C  and  A  and  between  C  and  M  are  the  same  as  with  the  model 
(AC,  AM,  CM)  fitted  in  Section  9.2.4. 

Removing  the  GM  term  from  this  model  yields  model  7  in  Table  10.2.  Its  graph  reveals 
that  A  separates  [G,  R)  from  (C,  M).  Thus,  all  pairwise  conditional  associations  among  A, 
C,  and  M  in  model  7  are  identical  to  those  in  model  (AC,  AM,  CM),  collapsing  over  G  and  R. 
In  fact,  model  7  does  not  fit  all  that  badly  (G2  =  28.81  with  df  =  20)  considering  the  large 
sample  size.  So,  we  could  collapse  over  gender  and  race  in  studying  associations  among  the 
primary  variables.  An  advantage  of  the  full  five-variable  model  is  that  it  estimates  effects  of 
gender  and  race  on  these  responses,  in  particular,  the  effects  of  race  and  gender  on  alcohol 
use  and  the  effect  of  gender  on  marijuana  use. 
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10.2.3  Loglinear  Model  Comparison  Statistics 

Consider  two  loglinear  models,  M\  and  Mq,  with  Mq  a  special  case  of  M\.  In  comparing 
pairs  of  models  above,  we  used  the  likelihood-ratio  statistic  for  testing  Mq  against  M \ , 
G2(Mq\M\)  =  G2(Mq)  —  G2(M\). 

Let  n  denote  a  column  vector  of  the  observed  cell  counts  {«, } .  Let  £0  and  £|  denote 
vectors  of  the  fitted  values  {Ao;}  and  (Au }  for  Mq  and  M \ .  The  deviance  G2{Mq)  for  the 
simpler  model  partitions  into 

G2{M0)  =  G2(M|)  +  G2(M0|M|).  (10.1) 

Just  as  G2{M)  measures  the  distance  of  fitted  values  forM  from  n,  G2(Mq\M\)  measures 
the  distance  of  fit  £0  from  fit  £, .  In  this  sense,  decomposition  (10.1)  expresses  a  certain 
orthogonality:  The  distance  of  n  from  £0  equals  the  distance  of  n  from  £ ,  plus  the  distance 
of  £  |  from  £0. 

As  noted  in  Section  4.5.4,  the  model  comparison  statistic  simplifies  to 


G\Mq\Mx)  =  2  £#i,  log(£„/Ao,).  (10.2) 


The  two  loglinear  models  have  the  matrix  form  (9.19),  namely, 

log  Ho  =  XqPq  and  log  /i\  =  X\p\. 

Since  Mq  is  simpler  than  M i,  we  can  express  log  j&o  =  XqPq  =  X i £*,  where  £,  equals  Pq 
with  0  elements  appended  corresponding  to  the  extra  parameters  in  f}\  that  are  not  in  £0- 
Then,  from  (10.2), 

G2(Mq\M\)  =  2nr(log  A,  -  log  Ao)  =  2«r[Ar,A,  -  Xtp*) 

=  2Af  [X\P\  -  Xif}]]  =  2p]  (log  Ai  -  log  Ao) 

=  2^  An  l°g(Aii/Ao/),  (10.3) 

/ 

where  the  replacement  of  n  by  Ai  follows  from  the  likelihood  equations  nT X\  =  A [  X  \ 
for  M\  [recall  (9.21)].  Statistic  (10.3)  has  the  same  form  as  G2{Mq),  but  with  { A l / }  playing 
the  role  of  the  observed  data.  Simon  (1973)  showed  a  general  result  of  this  type  for  natural 
exponential  family  distributions.  From  Section  4.5.5,  the  Pearson  statistic  X2(Mq\M \)  for 
comparing  loglinear  models  is  £T(Ai<  —  £o/)2/Ao/>  which  has  the  usual  Pearson  form  with 
[Ai/l  in  place  of  {«,  }. 

When  Mq  holds,  G2{Mq)  and  G2(M\  )  have  large-sample  chi-squared  distributions,  and 
G2(Mq\M\)  is  asymptotically  chi-squared  with  df  equal  to  the  difference  between  df  for 
Mq  and  M\.  Haberman  (1977a)  showed  that  G2{Mq\M\)  and  X2(Mq\M\)  have  the  same 
null  large-sample  behavior,  even  for  fairly  sparse  tables.  (Under  certain  conditions,  their 
difference  converges  in  probability  to  0  as  n  increases.)  When  M\  holds  but  Mq  does  not, 
G2(M i  )  still  has  its  asymptotic  chi-squared  distribution,  but  the  other  two  statistics  tend  to 
grow  unboundedly  as  n  increases. 
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10.2.4  Partitioning  Chi-Squared  with  Model  Comparisons 

Equation  (10.1)  utilizes  the  property  by  which  a  chi-squared  statistic  with  df  >  1  partitions 
into  components.  We  used  such  partitionings  in  tests  for  trend  with  ordinal  variables,  such 
as  in  linear  logit  and  linear  probability  models  (Section  5.3.5).  More  generally,  this  property 
applies  with  a  set  of  nested  models  to  test  a  sequence  of  hypotheses.  The  separate  tests  for 
comparing  pairs  of  models  are  asymptotically  independent. 

For  example,  a  chi-squared  decomposition  with  J  —  1  models  justifies  the  partitioning 
of  G2  stated  in  Section  3.3.3  for  testing  independence  in  2  x  J  tables.  For  j  =  2, . . . ,  7, 
let  My  denote  the  model  that  satisfies 


0,  =  (Mi/  M2,;+i)/(Mi,/+i  M2<)  =1.  /  =  1, . . . ,  j  -  1. 

For  My,  the  2  x  j  table  consisting  of  columns  1  through^  satisfies  independence.  Model  Mj 
is  independence  in  the  complete  2  x  J  table.  Model  Mh  is  a  special  case  of  My  whenever 
h  >  j .  By  (10.2), 

G2(Mj)  =  G2(My|My_,)  +  G2(M7_,) 

=  G2(M.,|My_,)  +  G2(Mj^\Mj-2)  +  G2(M  j  ~2) 

=  •  •  •  =  G2(My|My_,)  +  •  •  •  +  G2(M3|M2)  +  G2(M2). 


From  (10.3),  G2(M/|My_| )  has  the  G2  form  with  the  fitted  values  for  model  My_i  playing 
the  role  of  the  observed  data.  Substitution  of  fitted  values  for  the  two  models  into  (10.3) 
shows  that  G2(My|My_i)  is  identical  to  G2  for  testing  independence  in  a  2  x  2  table;  the 
first  column  combines  column  1  through  j  —  1  of  the  original  table,  and  the  second  column 
is  column  j  of  the  original  table. 

With  several  preplanned  comparisons,  simultaneous  test  procedures  lessen  the  prob¬ 
ability  of  attributing  importance  to  sample  effects  that  merely  reflect  chance  variation. 
These  procedures  use  adjusted  significance  levels.  For  a  set  of  5  independent  tests  for 
nested  models,  when  each  test  has  approximate  size  1  —  (1  —  a)]^\  the  overall  asymptotic 
F’jtype  I  error)  a.  For  instance,  suppose  that  we  test  the  fit  of  (WXZ,  WY,  XY,  ZY),  com¬ 
pare  that  model  to  (WX,  WZ,  XZ,  WY,  XY,  ZY),  and  compare  that  model  to  (WX,  WZ,  XZ, 
WY,  ZY).  To  have  overall  a  =  0.05  for  the  s  =  3  tests,  use  level  1  —  (0.95)1/3  =  0.01695 
for  each. 


10.2.5  Identical  Marginal  and  Conditional  Tests  of  Independence 

A  test  using  G2(Mo|M|)  simplifies  dramatically  when  both  models  have  direct  estimates.  In 
that  case,  the  models  have  independence  linkages  necessary  to  ensure  collapsibility.  A  test 
of  conditional  independence  then  has  the  same  result  as  the  test  of  independence  applied 
to  the  marginal  table.  Sundberg  (1975)  proved  the  following:  When  two  direct  models  Mo 
and  M|  are  identical  except  fora  pairwise  association  term,  G2(M0|Mi)  is  identical  to  G2 
for  testing  independence  in  the  marginal  table  for  that  pair  of  variables.  Bishop  (1971)  and 
Goodman  (1970,  1971b)  have  related  discussion. 

For  instance,  G2[(X,  Y,  Z)\(XY,  Z)]  tests  Zxy  =  0  in  model  (XY,  Z).  Thus,  it  tests  XY 
conditional  independence  under  the  assumption  that  X  and  Y  are  jointly  independent  of  Z. 
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Using  the  two  sets  of  fitted  values,  from  (10.3),  it  equals 


2EEE 


nij+n++k 


log 

/  '  " 

i  J  k 

= 2  J2  J2  n‘j+  ,os 


nij+n++t/n 

n,++n+j+n++k/n2 

n,j+ 

ni++n+j+/ n  ' 


which  is  G2[(X,  Y )]  fortesting  independence  in  the  marginal  XY  table.  This  is  not  surprising. 
The  collapsibility  conditions  imply  that  for  model  (XY,  Z),  the  marginal  XY  association  is 
the  same  as  the  conditional  AT  association. 


10.3  RESIDUALS  FOR  DETECTING  CELL-SPECIFIC  LACK  OF  FIT 

The  model  comparison  test  using  G2(M^\M\ )  is  useful  for  detecting  whether  an  extra  term 
improves  a  model  fit.  Cell  residuals  provide  a  cell-specific  indication  of  model  lack  of  fit. 


10.3.1  Residuals  for  Loglinear  Models 

In  Section  4.5.6  we  presented  residuals  that  apply  to  any  Poisson  GLM.  For  cell  i  in  a 
contingency  table  with  observed  count  n,  and  fitted  value  fx,,  the  Pearson  residual  is 


Hi  -  A/ 

Vdi  ' 

These  relate  to  the  Pearson  statistic  by  J2i  ej  =  X2. 

The  corresponding  standardized  residual  (Haberman  1973a)  is 


(10.4) 


where  the  leverage  hi  is  a  diagonal  element  of  the  estimated  hat  matrix.  This  has  an  asymp¬ 
totic  standard  normal  distribution  and  is  preferable  to  the  Pearson  residual.  Alternative 
residuals  use  components  of  the  deviance. 


10.3.2  Example:  Student  Survey  Revisited 

For  Table  10. 1  cross-classifying  alcohol,  cigarette,  and  marijuana  use  by  gender  and  race, 
we  suggested  in  Section  10.2.2  that  the  model  with  all  two-factor  associations  is  plausible. 
For  it,  the  only  large  standardized  residual  equals  3.2,  resulting  from  a  fitted  value  of  3.1 
in  the  cell  having  a  count  of  8.  Further  comparisons  suggested  that  the  simpler  model 
(AC,  AM,  CM,  AG,  AR,  GM,  GR )  is  adequate.  Its  only  large  standardized  residual  equals 
3.3,  from  the  fitted  value  of  2.9  in  that  cell. 

From  these  standardized  residuals,  the  number  of  nonwhite  males  who  did  not  use 
alcohol  or  marijuana  but  who  smoked  cigarettes  is  somewhat  greater  than  either  model 
predicts.  The  residuals  do  not  suggest  problems  with  either  model,  considering  the  large 
sample  size  and  the  many  cells  studied. 
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10.3.3  Identical  Loglinear  and  Logistic  Standardized  Residuals 

In  Section  9.5  we  showed  that  logistic  models  for  contingency  tables  are  equivalent  to 
certain  loglinear  models.  However,  a  Pearson  residual  for  a  logistic  model  differs  from  a 
Pearson  residual  for  a  loglinear  model.  The  numerators  comparing  the  zth  observed  and 
fitted  binomial  or  Poisson  count  are  the  same,  since  the  model  fitted  values  are  the  same. 
However,  the  logistic  model  uses  a  fitted  binomial  standard  deviation  in  the  denominator 
[see  (6.1)],  whereas  the  loglinear  model  uses  a  fitted  Poisson  standard  deviation  [see  ( 1 0.4)] . 
Thus,  the  logistic  Pearson  residual  exceeds  the  loglinear  Pearson  residual. 

Once  divided  by  estimated  standard  errors,  the  standardized  residuals  are  identical  for 
the  two  models.  This  is  another  reason  for  preferring  standardized  residuals  over  ordinary 
Pearson  residuals. 


10.4  MODELING  ORDINAL  ASSOCIATIONS 

The  loglinear  models  presented  so  far  have  a  serious  limitation — they  treat  all  classifications 
as  nominal.  If  the  order  of  a  variable’s  categories  changes  in  any  way,  the  fit  is  the  same. 
For  ordinal  classifications,  these  models  ignore  important  information. 

Refer  to  Table  10.3.  Subjects  were  asked  their  opinion  about  a  man  and  woman  having 
sexual  relations  before  marriage  and  also  asked  whether  methods  of  birth  control  should 
be  available  to  teenagers  between  the  ages  of  14  and  16.  For  the  loglinear  model  of 
independence,  denoted  by  /,  G2(/)  =  127.65  with  df  =  9.  The  model  fits  poorly.  Yet, 
adding  the  ordinary  association  term  makes  it  saturated  and  unhelpful. 

Table  10.3  also  contains  fitted  values  and  standardized  residuals  for  independence. 
The  residuals  in  the  corners  stand  out.  Sample  counts  are  much  larger  than  independence 


Table  10.3  Opinions  About  Premarital  Sex  and  Availability  of  Teenage  Birth  Control 


Teenage  Birth  Control 


Premarital  Sex 

Strongly  Disagree 

Disagree 

Agree 

Strongly  Agree 

Always  wrong 

81 

(42.4)“  (80.9)* 

7.6‘ 

68 

(51.2)  (67.6) 
3.1 

60 

(86.4)  (69.4) 
-4.1 

38 

(67.0)  (29.1) 
-4.8 

Almost  always  wrong 

24 

(16.0)  (20.8) 

2.3 

26 

(19.3)  (23.1) 
1.8 

29 

(32.5)  (31.5) 
-0.8 

14 

(25.2)  (17.6) 
-2.8 

Wrong  only  sometimes 

18 

(30.1)  (24.4) 
-2.7 

41 

(36.3)  (36.1) 
1.0 

74 

(61.2)  (65.7) 
2.2 

42 

(47.4)  (48.8) 
-1.0 

Not  wrong  at  all 

36 

(70.6)  (33.0) 

-6.1 

57 

(85.2)  (65.1) 
-4.6 

161 

(143.8)  (157.4) 
2.4 

157 

(11 1.4)  (155.5) 
6.8 

“Independence  model  fit. 

*Linear-by4inear  association  model  fit. 

‘Standardized  residuals  for  the  independence  model  fit. 

Source:  1991  General  Social  Survey,  National  Opinion  Research  Center. 
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predicts  where  both  responses  are  the  most  negative  possible  or  the  most  positive  possible. 
By  contrast,  the  counts  are  much  smaller  than  fitted  values  where  one  response  is  the  most 
positive  and  the  other  is  the  most  negative.  Cross-classifications  of  ordinal  variables  often 
exhibit  their  greatest  deviations  from  independence  in  the  corner  cells.  This  pattern  for 
Table  10.3  indicates  lack  of  fit  in  the  form  of  a  positive  trend.  People  who  are  more  willing 
to  make  birth  control  available  to  teenagers  also  tend  to  feel  more  tolerant  about  premarital 
sex. 

Models  for  ordinal  variables  use  association  terms  that  permit  trends.  The  models  are 
more  complex  than  the  independence  model,  yet  unsaturated.  They  are  called  association 
models,  because  they  focus  on  the  association  structure.  Tests  with  association  models  also 
have  improved  power  for  detecting  trends. 

10.4.1  Linear-by-Linear  Association  Model  for  Two-Way  Tables 

For  two-way  contingency  tables,  a  simple  model  for  two  ordinal  variables  assigns  ordered 
row  scores  U\  <  «2  <•••<«/  and  column  scores  iq  <  \’z  <  ■  ■  ■  <  vj.  The  model  is 

log  fXjj  —  X  +  X?  +  XYj  +  (10.5) 

with  constraints  such  as  X?  =  X]  =  0.  This  is  the  special  case  of  the  saturated  model  (9.2) 
in  which  Xfy  =  ftiijVj.  It  uses  only  one  parameter  to  describe  association,  whereas  the 
saturated  model  uses  (/  —  1)(7  —  1)  parameters. 

Independence  occurs  when  ft  =  0.  The  term  ftUjVj  represents  the  deviation  of  log  /u,7 
from  independence.  The  deviation  is  linear  in  the  Y  scores  at  a  fixed  level  of  X  and  linear  in 
the  X  scores  at  a  fixed  level  of  Y.  In  column  j,  for  instance,  the  deviation  is  a  linear  function 
of  X,  having  form  (slope)  x  (score  for  X),  with  slope  ftvj.  Because  of  this  property,  (10.5) 
is  called  the  linear-by-linear  association  mode I  (abbreviated,  L  x  L).  The  model  has  its 
greatest  departures  from  independence  in  the  corners  of  the  table. 

The  direction  and  strength  of  the  association  depend  on  ft.  When  ft  >  0,  Y  tends  to 
increase  as  X  increases.  Expected  frequencies  are  larger  than  expected  (under  independence) 
in  cells  where  X  and  Y  are  both  high  or  both  low.  When  ft  <  0,  Y  tends  to  decrease  as  X 
increases.  When  the  data  display  a  positive  or  negative  trend,  the  L  x  L  model  usually  fits 
much  better  than  the  independence  model. 

For  the  2  x  2  table  using  the  cells  intersecting  rows  a  and  c  with  columns  h  and  d,  direct 
substitution  shows  that  the  model  has 

log  ^ah^“'  =  ft{u (  -  ua)(vj  -  vfc).  (10.6) 

l^iul  M  ih 


This  log  odds  ratio  is  stronger  as  \ft\  increases  and  for  pairs  of  categories  that  are  far¬ 
ther  apart.  Simple  interpretations  result  when  1/2  —  «i  =•••  =  «/  —  «/_  1  and  Vi  —  V|  = 
•  •  •  =  \<j  —  v/_|.  When  {«,  =  / )  and  {v,  —  j),  for  instance,  the  local  odds  ratios  (2. 10)  for 
adjacent  rows  and  adjacent  columns  have  common  value  .  Goodman  (1979a)  called 
this  case  uniform  association.  Figure  10.2  portrays  local  odds  ratios  having  uniform 
value. 

The  choice  of  scores  affects  the  interpretation  of  ft.  Often,  the  response  scale  discretizes 
an  inherently  continuous  scale.  It  is  sensible  to  choose  scores  that  approximate  distances  be¬ 
tween  midpoints  of  categories  for  the  underlying  scale.  It  is  sometimes  useful  to  standardize 
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Figure  10.2  Constant  odds  ratio  implied  by  uniform  association  model.  (Note:  P  =  the  constant  log  odds  ratio 
for  adjacent  rows  and  adjacent  columns.) 


the  scores,  subtracting  the  mean  and  dividing  by  the  standard  deviation,  so 

«/*.•+  =  =0’ 

>  j 

S"*?7r''+  =  Hv2in+J  =  L 

i  J 

The  L  x  L  model  tends  to  fit  well  when  an  underlying  continuous  distribution  is  approx¬ 
imately  bivariate  normal.  For  standardized  scores,  /S  is  then  comparable  to  p/(  1  —  p2), 
where  p  is  the  underlying  correlation  (Goodman  1981a,b,  1985).  For  weak  associations, 

P  %  P- 

10.4.2  Corresponding  Logistic  Model  for  Adjacent  Responses 

A  logistic  formulation  of  the  L  x  L  model  treats  Y  as  a  response  and  X  as  explanatory.  Let 
7tj\i  =  P(Y  =  j\X  =  /).  Using  logits  for  adjacent  response  categories  (Section  8.3.4), 

log  JLl±^L  -  i0g  /X'-,+l  =  (XYt  -  XY)  +  P(vj+i  -  Vj)Uj. 

Xjii  IHj 

For  unit-spaced  {vy},  this  simplifies  to 

.  ^  j+ 1 1/  „ 

log  — -  =  OLj  +  put, 

71  j\i 

where  «f  =  XY+l  —  XY .  The  same  linear  logit  effect  ji  applies  simultaneously  for  all  (./  —  1) 
pairs  of  adjacent  response  categories:  The  odds  that  Y  =  j  +  1  instead  of  Y  =  j  multiply 
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by  for  each  unit  change  in  X.  In  using  equal -interval  response  scores,  we  implicitly 
assume  that  the  effect  of  X  is  the  same  on  each  of  the  ,/  —  I  adjacent-categories  logits  for  Y. 

10.4.3  Likelihood  Equations  and  Model  Fitting 

The  Poisson  log  likelihood  L(fi)  =  nu  l°g  l-lu  ~  JL,  Mi/  f°r  a  two-way  table 

simplifies  for  the  L  x  L  model  (10,5)  to 

L(n)  =  n\  +  'Y^  n,+kf  +  ^  n+jkY  +  P  ^  ^  u,  vj/iy 
i  ./'  i  j 

-  exp(A.  +  kf  +  kY  +  fiUjVj). 

i  j 

Differentiating  L(fi)  with  respect  to  (kf  ,kY ,  P)  and  setting  the  three  partial  derivatives 
equal  to  zero  yields  likelihood  equations 

£/+  =  «/+,  i  =  I . / .  £  +  y  =  «  +  ,'•  ./  =  1 . J- 

J2  Y.  u<  vj^j  =  YY vJn'j- 

<  j  '  j 

Iterative  methods  such  as  Newton-Raphson  yield  the  ML  fit. 

Let  pjj  =  rijj/n  and  n,,  —  Tty/n.  The  third  likelihood  equation  implies  that 

YYu'v'A'J  =  YYUiV<^- 

i  J  '  j 

Since  marginal  distributions  and  hence  marginal  means  and  variances  are  identical  for 
fitted  and  observed  distributions,  the  third  equation  implies  that  the  correlation  between  the 
scores  for  X  and  Y  is  the  same  for  both  distributions.  The  fitted  counts  display  the  same 
positive  or  negative  trend  as  the  data. 

Since  {n,}  and  {v;  J  are  fixed,  the  L  x  L  model  (10.5)  has  only  one  more  parameter  (P) 
than  the  independence  model.  Its  residual  df  =/,/—/—,/.  It  is  unsaturated  for  all  but 
2x2  tables. 

10.4.4  Example:  Sex  and  Birth  Control  Opinions  Revisited 

Table  10.3  also  reports  fitted  values  for  the  linear-by-linear  association  model,  using  scores 
(1, 2,  3, 4)  for  rows  and  columns.  Table  10.4  shows  software  output.  To  get  this,  we  added  to 
the  independence  model  a  variable  (denoted  by  “linlin”)  having  values  equal  to  the  product 
of  row  and  column  numbers.  Compared  with  the  independence  model,  for  which  C2(/)  = 
127.65  with  df  =  9,  the  L  x  L  model  fits  dramatically  better  [G2(L  x  L)  =  1 1.53,  df  =  8]. 
This  is  especially  noticeable  in  the  comers,  where  it  predicts  the  greatest  departures  from 
independence. 

The  ML  estimate  $  =  0.286  ( SE  =  0.028)  indicates  that  subjects  having  more  favor¬ 
able  attitudes  about  teen  birth  control  also  tend  to  have  more  tolerant  attitudes  about 
premarital  sex.  The  estimated  local  odds  ratio  is  exp(/3)  =  exp(0.286)  =  1.33.  The  95% 
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Table  10.4  Linear-by-Linear  Association  Model  Output  (SAS)  for  Table  10.3 


Criteria  For  Assessing  Goodness  Of  Fit 


Criterion 

DF 

Value 

Deviance 

8 

11 . 5337 

Pearson  Chi-Square 

8 

11.5085 

Standard 

Wald  95% 

Conf  . 

Chi- 

Parameter 

Estimate 

Error 

Limits 

Square 

Pr  >  ChiSq 

Intercept 

0.4735 

0.4339 

-0.3769 

1.3239 

1 . 19 

0.2751 

premar 

1 

1.7537 

0.2343 

1.2944 

2  .2129 

56 .01 

<.0001 

premar 

2 

0.1077 

0 . 1988 

-0.2820 

0.4974 

0.29 

0.5880 

premar 

3 

-0.0163 

0 . 1264 

-0.2641 

0.2314 

0.02 

0 . 8972 

premar 

4 

0.0000 

0 . 0000 

0.0000 

0 . 0000 

birth 

1 

1 . 8797 

0.2491 

1 . 3914 

2.3679 

56 . 94 

<.0001 

birth 

2 

1.4156 

0 . 1996 

1 . 0243 

1 . 8068 

50.29 

<.0001 

birth 

3 

1.1551 

0.1291 

0.9021 

1.4082 

80 . 07 

<.0001 

birth 

4 

0.0000 

0.0000 

0.0000 

0 . 0000 

linlin 

0.2858 

0.0282 

0.2305 

0.3412 

102.46 

<.0001 

LR  Statistics 

Source  DF  Chi-Square  Pr  >  ChiSq 
linlin  1  116.12  >.0001 


Wald  confidence  interval  is  exp[0.286  ±  1 .96(0.028)],  or  (1.26,  1.41).  The  strength  of 
association  seems  weak.  From  (10.6),  however,  nonlocal  odds  ratios  are  stronger.  The  es¬ 
timated  odds  ratio  for  the  four  corner  cells,  obtained  using  or  the  corner  fitted  values, 
equals 


exp[/l(«4  -  u ]  )(v4  —  V])]  =  exp[0. 286(4  —  1)(4-  1)]  =  ^  =  13.1. 

Suppose  we  regard  categories  2  and  3  as  farther  apart  than  categories  1  and  2,  or 
categories  3  and  4.  Scores  such  as  (1, 2.  4,  5}  for  rows  and  columns  recognize  this.  The 
L  x  L  model  then  has  G 2  =  8.85  (df  =  8)  and  fi  =  0.146  ( SE  =  0.014).  But  we  need 
not  regard  the  scores  as  approximations  for  distances  between  categories  or  as  reasonable 
scalings  of  ordinal  variables  in  order  for  the  models  to  be  valid.  They  merely  imply  a 
certain  pattern  for  the  odds  ratios.  If  the  L  x  L  model  fits  well  with  equally  spaced  row  and 
column  scores,  the  uniform  local  odds  ratio  describes  the  association  regardless  of  whether 
the  scores  are  sensible  indexes  of  true  distances  between  categories. 

For  scores  {«,  =  i)  with  Table  10.3,  the  marginal  mean  and  standard  deviation  for 
premarital  sex  are  2.81  and  1.26.  The  standardized  scores  are  {(/  —  2.81  )/l .26).  or 
(-1.44,-0.65,0.15,0.95).  The  standardized  equal-interval  scores  for  birth  control  are 
(—1.65,  —0.69,  0.27,  1.23).  For  these  scores,  /?  =  0.374.  Solving  ft  =  p/(  1  —  p 2)  for  p 
yields  p  =  0.333.  If  there  is  an  underlying  bivariate  normal  distribution,  we  estimate  the 
correlation  to  be  0.333. 
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10.4.5  Directed  Ordinal  Test  of  Independence 

For  the  linear-by-linear  association  model,  Hq\  independence  is  Hq:  P  =  0.  The  likelihood- 
ratio  test  statistic  equals 


G2(I\L  x  L)  =  G2(/)  -  G\L  x  L). 

Designed  to  detect  positive  or  negative  trends,  it  has  df  =  1.  For  Table  10.3,  G2(/|L  xL)  = 
127.65  —  1 1.53  =  1 16. 1  has  P  <  0.0001,  extremely  strong  evidence  of  an  association. 
The  Wald  statistic  z2  =  ( $/SE )2  =  (0.286/0.0282)2  =  102.5  (df  =  1)  also  shows  strong 
evidence.  The  correlation  statistic  (3. 16)  presented  in  Section  3.4. 1  fortesting  independence 
is  the  score  statistic  for  H0:  f  =  0  in  this  model  (Exercise  10.27).  It  equals  1 12.6  (df  =  1). 

When  the  L  x  L  model  holds,  the  ordinal  test  using  G2(I\L  x  L)  is  asymptotically  more 
powerful  than  the  test  using  G2(/).  This  is  because  the  power  of  a  chi-squared  test  increases 
when  df  decrease,  for  fixed  noncentrality  (Section  5.3.8).  When  the  L  x  L  model  holds,  the 
noncentrality  is  the  same  for  G2(/ 1 L  x  L)  and  G2(/);  thus  G2(/ \L  x  L)  is  more  powerful, 
since  its  df  =  1  compared  with  (/  —  I )(./  —  1)  for  G2(/).  The  power  advantage  increases 
as  /  and/  increase,  since  the  noncentrality  remains  focused  on  df  =  1  for  G2(/ \L  x  L)  but 
df  also  increases  for  G2(/). 

10.4.6  Row  Effects  and  Column  Effects  Association  Models 

Generalizations  of  the  linear-by-linear  association  model  treat  some  or  all  scores  as  pa¬ 
rameters  rather  than  fixed.  To  illustrate,  replacing  the  ordered  row  values  [put]  in  the 
linear-by-linear  term  fiujVj  in  model  (10.5)  by  unordered  parameters  {/u., }  gives 

log  iijj  =  A.  +  A,*  +  X]  +(iiVj.  (10.7) 

Constraints  are  needed  such  as  Xf  —  XYj  —  pi  —  0.  The  {p, )  are  called  row  effects  and 
the  model  is  called  the  row  effects  model.  Since  the  row  effects  are  unordered  but  ( Vy } 
are  ordered,  this  model  treats  X  as  nominal  and  Y  as  ordinal.  The  independence  model 
is  the  special  case  p\  =  ■  ■  ■  =  pi .  A  corresponding  column  effects  model  has  association 
term  UjVj.  It  treats  X  as  ordinal  with  ordered  scores  {«,)  and  Y  as  nominal  with  unordered 
parameters  { Vy } . 

The  likelihood  equations  for  the  row  effects  model  (10.7)  are  {/&,+  =  «,+},  {£+;  =  «+;), 
and 


J2  VJV"J  =  J2  vJn‘J'  '  = 

j  i 

Let  Ttj\i  —  p,ij/fLi+  and  pjy  =  njj/ni+.  Since  /7,+  =  ni+,  the  third  likelihood  equation  is 
R jJt j\i  =  Ylj  vjPj\i ■  F°r  the  conditional  distribution  within  each  row,  the  mean  column 
score  is  the  same  for  the  fitted  and  sample  distributions.  The  likelihood  equations  are  solved 
using  iterative  methods  such  as  Newton-Raphson. 

With  { vy+  j  —  vj  =  1 ),  the  row  effects  model  has  adjacent-categories  logit  form 


log 


P{Y  =  j  +  l\X  =  i) 


=  Ctj  +  llj. 


P{Y  =  j\X  =  i ) 


(10.8) 
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The  effect  in  row  i  is  identical  for  each  pair  of  adjacent  responses.  Differences  among  {yu, } 
compare  rows  with  respect  to  their  conditional  distributions  on  F.  When  /u,  =  rows  h 
and  i  have  identical  conditional  distributions.  If  fij  >  ni,,Y  is  stochastically  higher  in  row  / 
than  row  h  (Exercise  10.25).  The  row  effects  model  also  applies  when  X  is  ordinal  but  row 
scores  for  a  linear-by-linear  association  structure  are  unknown  and  need  to  be  estimated, 
as  illustrated  in  the  following  example. 

10.4.7  Example:  Estimating  Category  Scores  for  Premarital  Sex 

In  modeling  Table  10.3,  we  assigned  equally  spaced  scores  to  both  ordinal  variables.  If  we 
regard  those  scores  as  a  scaling  with  relevant  distances  between  categories,  it  is  not  clear 
that  this  is  sensible  for  categories  such  as  (always  wrong,  almost  always  wrong,  wrong  only 
sometimes,  not  wrong  at  all).  We  next  fitted  the  row  effects  model,  again  using  column 
scores  (1,  2,  3,  4).  The  model  fits  well,  with  deviance  G 2  =  7.59  on  df  =  6.  a  decrease  of 
3.95  on  df  =  2  compared  with  the  L  x  L  model. 

With  constraint  n 4  =  0,  the  other  three  row  effect  estimates  contrast  the  first  three  rows 
with  those  who  responded  “not  wrong  at  all”  on  premarital  sex.  The  ML  estimates  are 
£1  =  -0.584  (SE  =  0.059),  ( i2  =  -0.496  (SE  =  0.080),  £3  =  -0.203  (SE  =  0.065).  So, 
the  conditional  distributions  on  the  column  variable  differ  more  between  rows  2  and  3  than 
between  rows  1  and  2  or  rows  3  and  4.  This  explains  why  the  L  x  L  model  fitted  better 
using  scores  (1,  2,  4,  5)  than  (1,  2,  3,  4).  The  further  £,•  falls  in  the  negative  direction,  the 
greater  the  tendency  for  those  in  row  i  to  locate  at  the  “strongly  disagree”  end  of  the  column 
scale,  relative  to  those  in  row  4.  From  (10.8)  the  model  predicts  constant  odds  ratios  for 
adjacent  columns  of  teenage  birth  control.  For  instance,  the  estimated  odds  that  those  in 
row  4  responded  in  category  j  +  1  instead  of  j  were  exp(£4  —  (L\)  =  exp(0.584)  =  1.79 
times  the  corresponding  estimated  odds  for  those  in  row  I,  j  —  1,2.3. 

10.4.8  Ordinal  Variables  in  Models  for  Multiway  Tables 

Multidimensional  tables  with  ordinal  responses  can  use  generalizations  of  association 
models.  In  three  dimensions,  the  rich  collection  of  models  includes  association  models 
that  are  more  parsimonious  than  the  model  ( XY ,  XZ,  YZ)  which  treats  all  variables  as 
nominal-scale,  and  models  permitting  heterogeneous  association  that,  unlike  model  (XYZ), 
are  unsaturated. 

Models  for  association  that  are  special  cases  of  (XY ,  XZ ,  YZ)  replace  association  terms 
by  structured  terms  that  account  for  ordinality.  For  instance,  when  both  X  and  Y  are  ordinal, 
alternatives  to  X* Y  are  a  linear-by-linear  term  / )m,  vy- ,  a  row  effects  term  Hivj>  or  a  column 
effects  term  u,  u;;  these  provide  a  stochastic  ordering  of  conditional  distributions  within 
rows  and  within  columns,  or  only  within  rows,  or  only  within  columns.  With  a  linear-by- 
linear  term, 


log  E-ijk  —  'X  +  Xf  +  XYj  +  Xz  +  ftiijVj  +  Xfkz  +  Xjz  ( 10.9) 

has  conditional  local  odds  ratios  (9.13)  that  satisfy 


log 0,m  =  P(ui+ 1  -  Ui)(vj+i  -  vj)  for  all  k. 

The  association  is  the  same  in  each  partial  table,  with  homogeneous  linear-by-linear  XY 
association. 
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When  the  association  is  heterogeneous,  structured  terms  for  ordinal  variables  make 
effects  simpler  to  interpret  than  in  the  saturated  model.  For  instance,  the  heterogeneous 
linear-hy-linear  XY association  model  adds  a  term  ,  to  model  (10.9),  thereby  allowing 

the  XY  association  to  change  across  levels  of  Z. 

Tests  of  conditional  independence  of  ordinal  classifications  can  generalize  G2(/ 1 L  x  L). 
For  instance,  we  can  compare  the  XY  conditional  independence  model  ( XZ ,  YZ)  to  the 
homogeneous  linear-by-linear  XY  association  model  (10.9).  It  tests  ft  =  0  in  that  model, 
with  df  =  1.  This  is  an  alternative  to  the  ordinal  test  of  conditional  independence  in  Section 
8.4.3.  Like  Mantel’s  score  statistic  (8.19),  this  statistic  uses  correlation  information,  since 
£*(£/£;«■■  Vjiijjk)  is  the  sufficient  statistic  for  ft  in  model  (10.9).  In  fact,  the  Mantel 
statistic  provides  a  score  test  of  Ho:  ft  —  0  in  that  model. 


10.5  GENERALIZED  LOGLINEAR  AND  ASSOCIATION  MODELS, 
CORRELATION  MODELS,  AND  CORRESPONDENCE  ANALYSIS 

We’ve  just  seen  how  to  construct  loglinear  models  that  describe  ordinal  associations.  Other 
generalizations,  some  of  which  do  not  have  loglinear  form,  can  describe  associations  for 
both  ordinal  and  nominal  variables. 

10.5.1  Generalized  Loglinear  Model 

In  Section  9.6.2,  we  expressed  loglinear  models  for  cell  counts  n  and  expected  frequencies 
ft  as  log  /z.  =  X/}.  A  generalized  loglinear  model  that  allows  many  additional  models  is 

C  \og(Afi)  =  Xfi  (10.10) 

for  matrices  C  and  A.  The  ordinary  loglinear  model  results  when  C  and  A  are  identity 
matrices.  Other  special  cases  include  logistic  models  for  binary  or  multicategory  responses 
(Grizzle  et  al.  1969). 

For  instance,  the  loglinear  model  of  independence  for  a  2  x  2  table  is  equivalent  to  a 
model  by  which  the  logit  for  Y  is  the  same  in  each  row  of  X  (see  Section  9. 1 .2).  That 
logit  model  has  form  (10.10):  A  is  a  4  x  4  identity  matrix,  so  Agt  is  the  4  x  1  vector 
H  =  (jU.|i,  /x  12,  /r 2i,  H2i)T\  the  product  C  log(Ajz)  forms  the  logit  in  row  1  and  the  logit  in 
row  2  using 


1-10  0 
0  0  1-1 


then  X  =  (1 ,  l)r  is  a  2  x  I  matrix,  and  p  is  a  single  constant  a,  so  Xp  forms  a  common 
value  for  those  two  logits. 

The  generalized  loglinear  model  (10.10)  includes  models  for  association  that  are  not 
ordinary  loglinear  models.  For  example,  for  a  I  x  J  cross-classification  of  two  ordinal 
variables,  the  odds  ratio  obtained  by  collapsing  to  a  2  x  2  table  by  combining  rows  1  to 
i,  rows  i  +  1  to  /,  columns  1  to  j,  and  columns  j  +  I  to ./,  is  called  a  global  odds  ratio. 
A  uniform  association  model  specifies  a  common  value  for  the  (/  —  1 )(,/  —  1)  population 
global  odds  ratios.  This  is  an  alternative  to  the  uniform  association  model  for  local  odds 
ratios  that  results  from  the  linear-by-linear  association  model  with  equally  spaced  scores. 
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In  Chapters  11  and  12  we  use  the  generalized  loglinear  model  for  models  outside  the 
classes  of  GLMs  studied  thus  far,  such  as  models  for  marginal  distributions  of  multivariate 
responses.  Lang  (1996a,  2004,  2005)  has  developed  ML  fitting  methods  and  an  R  function 
for  generalized  loglinear  models.  His  methods  apply  to  an  even  broader  family  of  models 
for  contingency  tables,  called  multinomial  Poisson  homogeneous  models,  that  have  the  form 
L(fi )  =  Xfi  for  a  general  link  function  L. 

10.5.2  Multiplicative  Row  and  Column  Effects  Model 

The  linear-by-linear  association  (L  x  L)  model  (10.5)  is  a  special  case  of  the  row  effects 
(R)  model,  which  has  parameter  row  scores,  and  the  column  effects  (C)  model,  which  has 
parameter  column  scores.  These  models  are  special  cases  of  a  more  general  association 
model  with  row  and  column  parameter  scores.  Replacing  { m,  }  and  { vy }  in  the  L  x  L  model 
by  parameters  yields  the  row  and  column  effects  ( RC )  model  (Goodman  1979a) 

log  mj  =  X  +  A*  +  XYj  +  fill,  vj.  (10.1 1) 

This  model  is  not  loglinear,  because  the  predictor  is  a  multiplicative  (rather  than  linear) 
function  of  parameters  /x,  and  vj.  Identifiability  requires  location  and  scale  constraints  on 
{/X,- }  and  { Uj} .  The  model  treats  classifications  as  nominal;  the  same  fit  results  from  any 
permutation  of  rows  or  of  columns.  Parameter  interpretation  is  simplest  when  at  least  one 
variable  is  ordinal,  through  the  local  log  odds  ratios 


logfly  =  fi(lll  +  ,  -  Hi)(Vj+  1  -  Vj). 


Although  it  may  seem  appealing  to  use  parameters  instead  of  arbitrary  scores,  the  RC 
model  presents  complications  that  do  not  occur  with  loglinear  models.  The  likelihood  may 
not  be  concave  and  may  have  local  maxima.  Independence  is  a  special  case,  but  it  is  awkward 
to  test  independence  using  the  RC  model.  Haberman  (1981)  showed  that  the  null  distribution 
of  G2(I)  —  G2(RC)  is  not  chi-squared  but  rather  that  of  the  maximum  eigenvalue  from  a 
Wishart  matrix.  Haberman  ( 1 995)  provided  fitting  methods  for  association  models  including 
nonlinear  models  such  as  this  one.  Software  is  available1 . 

Goodman  (1985)  expressed  the  association  term  in  the  saturated  model  in  a  form  that 
generalizes  the  /3 /z,  vj  term  in  the  RC  model,  namely, 

M 

=  J. >*#**>'/*.  00.12) 

where  M  =  min(/  —  1,  J  —  1).  The  parameters  satisfy  constraints  such  as 

T.  Pik  T,+  =  y  vjk  Jt+J  =  0  for  all  k, 

‘  j 

y.  a4  tri+  =  y  vfk  jt+j  =  1  for  all  k,  ( 1 0. 1 3) 

‘  j 

y  Pik  Pih  T,+  =  y  Vjk  vjh  71+j  =0  for  all  k  /  h. 
i  j 


1  Such  as  the  gnm  function  in  R  developed  by  Turner  and  Firth  (2007). 
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Table  10.5  Cross-Classification  of  Mental  Health  Status  and  Socioeconomic  Status 


Mental  Health  Status 


Parents’ 

Socioeconomic 

Status 

Well 

Mild 

Symptom 

Formation 

Moderate 

Symptom 

Formation 

Impaired 

A  (high) 

64 

94 

58 

46 

B 

57 

94 

54 

40 

C 

57 

105 

65 

60 

D 

72 

141 

77 

94 

E 

36 

97 

54 

78 

F  (low) 

21 

71 

54 

71 

Source:  Reprinted  with  permission  from  L.  Srole  et  al.  Menial  Health  in  the  Metropolis:  The  Midtown  Manhattan 
Study.  New  York:  NYU  Press,  1978,  p.  289. 


When  ft  =  0  for  k  >  M* ,  model  (10.12)  is  called  the  RC(M*)  model.  The  RC  model 
(10.1 1)  is  the  case  M*  =  1. 


10.5.3  Example:  Mental  Health  and  Parents’  SES 

Table  10.5  describes  the  relationship  between  child’s  mental  impairment  and  parents’ 
socioeconomic  status  for  a  sample  of  residents  of  Manhattan  (Goodman  1979a).  The 
RC  model  fits  well  ( G 2  — 3.57 ,  df  =  8).  For  scaling  (10.13),  the  ML  estimates  are 
(-1.11,-1.12,  -0.37,  0.03,  1 .01 ,  1 .82)  for  the  row  scores,  (-1 .68,  -0. 14,  0. 14,  1 .41 )  for 
the  column  scores,  and  0  =  0.17.  Nearly  all  estimated  local  log  odds  ratios  are  positive, 
indicating  a  tendency  for  mental  health  to  be  better  at  higher  levels  of  parents’  SES. 

Ordinal  loglinear  models  also  fit  well.  For  equal-interval  scores,  G2(L  xL)  = 
9.89  (df  =  14).  The  statistic  G2(L  x  L\RC )  =  6.32  (df  =  6)  tests  that  row  and  column 
scores  in  the  RC  model  are  equal-interval.  The  parameter  scores  do  not  provide  a  signifi¬ 
cantly  better  fit.  It  is  sufficient  to  use  a  uniform  local  odds  ratio  to  describe  the  table.  For  unit¬ 
spaced  scores,  0  =  0.09 1  (SE  =  0.0 1 5),  so  the  fitted  local  odds  ratio  is  exp(0.09 1 )  =  1 .09. 
There  is  strong  evidence  of  positive  association,  but  the  degree  of  association  is  rather 
weak,  at  least  locally. 


10.5.4  Correlation  Models 

A  correlation  model  for  two-way  tables  has  many  features  in  common  with  the  RC  model 
(Goodman  1985,  1986).  In  a  one-dimensional  version,  it  is 

71  ij  =  7ti+7T+j(l  +  km  Vj),  (10.14) 

where  {/x,  }  and  ( v,  |  are  score  parameters  satisfying 

X  w r-+  =  X  vJn+j  = 0  and  X  ^ n>+  =  X  v2jn+i =  1  • 

i  j  i  J 

The  parameter  A  is  the  correlation  between  the  scores  for  joint  distribution  (10.14). 
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The  correlation  model  is  also  called  the  canonical  correlation  model,  because  ML  esti¬ 
mates  of  the  scores  maximize  the  correlation  for  (10. 14).  The  general  canonical  correlation 
model  is,  for  M  =  min(7  —  1,7  —  1), 


(M 

1  T  ^  '  i'-L  dik  Vjk 

k=  I 

whereO<AM  <  •  •  •  <  A|  £  1  and  with  constraints  such  as  in  (10.1 3).  The  parameter  A*  is 
the  correlation  between  {/u.,*,  i  —  1,  . . .  /}  and  { v^,  j  =  1, . . .  J}.  The  { /x,  i )  and  { v:  i )  are 
standardized  scores  that  maximize  the  correlation  A|  for  the  joint  distribution;  {/i.,2)  and 
{vj2}  are  standardized  scores  that  maximize  the  correlation  j/.j).  subject  to  {/x,j )  and  {/x/2 } 
being  uncorrelated  and  {vj\ }  and  {u^l  being  uncorrelated,  and  so  on. 

Unsaturated  models  result  from  taking  A*  =  0  for  k  >  M*  with  M *  <  M.  Gilula  and 
Haberman  (1986)  and  Goodman  (1985)  discussed  ML  fitting.  When  A  is  close  to  zero 
in  (10.14),  Goodman  (1981a,  1985,  1986)  noted  that  ML  estimates  of  A  and  the  score 
parameters  are  similar  to  those  of  fi  and  the  score  parameters  in  the  RC  model.  Correlation 
models  can  also  use  fixed  scores  instead  of  parameter  scores. 

Goodman  discussed  advantages  of  association  models  over  correlation  models.  The 
correlation  model  is  not  defined  for  all  possible  combinations  of  score  values  because  of 
the  constraint  0  <  7 r,y  <  1,  ML  fitted  values  do  not  have  the  same  marginal  totals  as  the 
observed  data,  and  the  model  is  not  simply  generalizable  to  multiway  tables. 

10.5.5  Correspondence  Analysis 

Correspondence  analysis  is  a  graphical  way  to  represent  associations  in  two-way  contin¬ 
gency  tables.  The  rows  and  columns  are  represented  by  points  on  a  graph,  the  positions  of 
which  indicate  associations.  Goodman  (1985,  1986)  noted  that  coordinates  of  the  points 
are  reparameterizations  of  {/x,*}  and  { vjk }  in  the  general  canonical  correlation  model.  Cor¬ 
respondence  analysis  uses  adjusted  scores 

%ik  =  i-k  i-kjk-  Vjk  —  A<  Vjfc. 

These  are  close  to  zero  for  dimensions  k  in  which  the  correlation  A*  is  close  to  zero.  A 
correspondence  analysis  graph  uses  the  first  two  dimensions,  plotting  (jc, i,xn)  for  each 
row  and  (yj\,  y/2)  for  each  column. 

Goodman  (1985,  1986)  used  Table  10.5  to  illustrate  the  similarities  of  correspondence 
analysis  to  analysis  using  correlation  models  and  association  models.  For  the  general 
canonical  correlation  model,  M  =  3  and  (A^,  =  (0.0260,  0.0014,  0.0003).  The 

association  is  rather  weak.  Table  10.6  contains  estimated  row  and  column  scores  for  the 
correspondence  analysis  of  these  three  dimensions.  Both  sets  of  scores  in  the  first  dimension 
fall  in  a  monotone  increasing  pattern,  except  for  a  slight  discrepancy  between  the  first  two 
row  scores.  This  indicates  an  overall  positive  association.  The  scores  for  the  second  and 
third  dimensions  are  close  to  zero,  reflecting  the  relatively  small  A2  and  A3. 

Figure  10.3  exhibits  the  results  of  the  correspondence  analysis.  The  horizontal  axis 
has  estimates  for  the  first  dimension,  and  the  vertical  axis  has  estimates  for  the  second 
dimension.  Six  circular  points  represent  the  six  rows,  with  point  i  giving  (i,  1 ,  jp/2)-  Similarly, 
four  square  points  display  the  estimates  (yji.yji)-  Both  sets  of  points  lie  close  to  the 
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Table  10.6  Scores  from  Correspondence  Analysis  Applied  to  Table  10.5 


Column  Score 

Dimension 

Row  Score 

Dimension 

1 

2 

3 

1 

2 

3 

1 

0.260 

0.012 

0.023 

1 

0.181 

-0.018 

0.028 

2 

0.030 

0.024 

-0.019 

2 

0.185 

-0.011 

-0.026 

3 

-0.013 

-0.069 

-0.002 

3 

0.059 

-0.021 

-0.010 

4 

-0.236 

0.019 

0.016 

4 

-0.008 

0.042 

0.011 

5 

-0.164 

0.044 

-0.009 

6 

-0.287 

-0.061 

0.005 

Source:  Reprinted  with  permission  from  the  Institute  of  Mathematical  Statistics,  based  on  Goodman  (1985). 


horizontal  axis,  since  the  first  dimension  is  more  important  than  the  second.  Row  points 
that  are  close  together  represent  rows  with  similar  conditional  distributions  across  the 
columns.  Close  column  points  represent  columns  with  similar  conditional  distributions 
across  rows. 

Correspondence  analysis  is  used  mainly  as  a  descriptive  tool.  Goodman  ( 1 986)  developed 
inferential  methods  for  it.  For  Table  1 0.5,  inferential  analysis  reveals  that  the  first  dimension, 
accounting  for  94%  of  the  total  squared  correlation,  is  adequate  for  describing  the  associa¬ 
tion.  Goodman  argued  for  choosing  the  unsaturated  model  employing  only  one  dimension 
and  having  graphics  display  fitted  scores  for  that  dimension  alone.  Then,  correspondence 
analysis  is  equivalent  to  a  ML  analysis  using  the  one-dimensional  correlation  model  (10.1 4). 
The  estimated  scores  for  that  model  are  (—1.09,  —1.17,  —0.37,0.05,  1.01,  1.80)  for  the 
rows  and  (-1.60,  -0.19,0.09,  1.48)  for  the  columns.  The  model  fits  well  (G2  =  2.75, 
df  =  8). 

The  quality  of  fit  and  the  estimated  scores  are  similar  to  those  shown  in  Section  10.5.3  for 
the  RC  model.  More  parsimonious  correlation  models  also  fit  these  data  well,  such  as  ones 
using  equally  spaced  scores.  All  analyses  of  Table  10.5  have  yielded  similar  conclusions 
about  the  association.  They  all  neglect,  however,  that  mental  health  is  a  natural  response 
variable.  It  may  be  more  relevant  to  use  an  ordinal  logistic  model. 


Dimension  2  (5%) 

4 


O 

5 


o 

6 


O  Dimension  1  (94%) 


02 

1 


Figure  10.3  Graphical  display  of  scores  from  first  two  dimensions  of  correspondence  analysis.  [Based  on 
Escoufier  (1982);  reprinted  with  permission.] 
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Like  correlation  models,  a  severe  limitation  of  correspondence  analysis  (CA)  is  nontrivial 
generalization  to  multiway  tables.  Greenacre  (2007)  showed  displays  of  several  pairwise 
associations  in  a  single  plot  and  discussed  a  multiple  correspondence  analysis  that  applies 
CA  to  a  large  matrix  that  contains  all  the  pairwise  cross-tabulations  for  pairs  of  the  set  of 
variables  being  analyzed. 

10.5.6  Model  Selection  and  Score  Choice  for  Ordinal  Variables 

We  have  presented  several  ways  to  use  category  orderings  in  model  building.  To  choose 
among  loglinear  models,  one  approach  uses  the  standard  models  for  guidance.  If  a  standard 
model  fits  well,  simplify  by  replacing  some  parameters  with  structured  terms  for  ordinal 
classifications. 

Association,  correlation,  and  correspondence  analysis  models  have  scores  for  categories 
of  ordinal  variables.  Parameter  interpretations  are  simplest  for  equally  spaced  scores.  With 
parameter  scores,  the  resulting  ML  estimates  of  scores  need  not  be  monotone.  Constrained 
versions  of  the  models  force  monotonicity  by  maximizing  the  likelihood  subject  to  order 
restrictions  (Agresti  et  al.  1987,  Bartolucci  and  Forcina  2002,  Ritov  and  Gilula  1991). 
Disadvantages  exist,  however,  of  treating  scores  as  parameters.  The  model  becomes  less 
parsimonious,  tests  of  effects  may  be  less  powerful  because  of  a  greater  df  value,  and  those 
tests  may  not  use  standard  distributions. 


10.6  EMPTY  CELLS  AND  SPARSENESS  IN  MODELING 
CONTINGENCY  TABLES 

Sparse  contingency  tables  occur  when  the  sample  size  n  is  small  or  when  the  number  of  cells 
is  large  relative  to  n.  Sparseness  is  common  in  tables  with  many  variables.  It  can  have  an 
impact  on  loglinear  model  fitting.  The  following  discussion  refers  to  a  generic  contingency 
table  and  model,  with  cell  counts  {«, }  and  expected  frequencies  [p, }  for  n  observations  in 
N  cells. 

10.6.1  Empty  Cells:  Sampling  Versus  Structural  Zeros 

Sparse  tables  usually  contain  some  cells  with  «,  =  0.  These  empty  cells  are  of  two  types: 
sampling  zeros  and  structural  zeros.  In  most  cases,  even  though/),  =  0,  //,  >  0.  It  is  possible 
to  have  observations  in  the  cell,  and  n,  >  0  with  sufficiently  large  n.  Such  an  empty  cell  is 
called  a  sampling  zero.  The  empty  cells  in  Table  10. 1  for  the  student  survey  are  sampling 
zeros. 

An  empty  cell  in  which  observations  are  impossible  is  called  a  structural  zero.  For 
such  cells  pi  =  0  and  necessarily  n ,•  =  0  and  yu,  =  0  regardless  of  n.  For  a  table  that 
cross-classifies  cancer  patients  on  their  gender,  race,  and  type  of  cancer,  some  cancers 
(e.g.,  prostate  cancer,  ovarian  cancer)  are  gender  specific.  Thus,  certain  cells  have  structural 
zeros.  Contingency  tables  with  structural  zeros  are  called  incomplete  tables. 

Sampling  zeros  are  part  of  the  data  set.  A  count  of  0  is  a  permissible  outcome  for 
a  Poisson  or  a  multinomial  variate.  It  contributes  to  the  likelihood  function  and  model 
fitting.  A  structural  zero,  on  the  other  hand,  is  not  an  observation  and  is  not  part  of  the  data. 
Sampling  zeros  are  much  more  common  than  structural  zeros,  and  the  remaining  discussion 
refers  to  them. 
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10.6.2  Existence  of  Estimates  in  Loglinear  Models 

Sampling  zeros  can  affect  the  existence  of  finite  ML  estimates  of  loglinear  model  parame¬ 
ters.  Haberman  (1973b,  1974a),  generalizing  work  by  Birch  (1963)  and  Fienberg  (1970b), 
studied  this.  For  cell  counts  n  with  expected  values  /z,  Haberman  showed  results  1  through 
5  for  Poisson  sampling,  and  by  result  6  they  apply  also  to  multinomial  sampling. 

1.  The  log-likelihood  function  is  a  strictly  concave  function  of  log  ft. 

2.  If  a  ML  estimate  of  ft  exists,  it  is  unique  and  satisfies  the  likelihood  equations 
Xrn  =  XT  ji.  Conversely,  if  /z  satisfies  the  model  and  also  the  likelihood  equations, 
it  is  the  ML  estimate  of  /z. 

3.  If  all  «,  >  0,  ML  estimates  of  loglinear  model  parameters  exist. 

4.  Suppose  that  ML  parameter  estimates  exist  for  a  loglinear  model  that  equates  observed 
and  fitted  counts  in  certain  marginal  tables.  Then  those  marginal  tables  have  uniformly 
positive  counts. 

5.  If  ML  estimates  exist  for  a  model  M,  they  also  exist  for  any  special  case  of  M . 

6.  For  any  loglinear  model,  the  ML  estimates  /z  are  identical  for  multinomial  and 
independent  Poisson  sampling,  and  those  estimates  exist  in  the  same  situations. 

To  illustrate,  consider  the  saturated  model.  By  results  2  and  3,  when  all  «,  >  0,  the  ML 
estimate  of  /z  is  n.  By  result  4,  parameter  estimates  do  not  exist  when  any  n,  =  0.  Model 
parameter  estimates  are  contrasts  of  {log  fi, },  and  since  ft  =  n  for  the  saturated  model,  the 
estimates  are  finite  only  when  all  «,  >  0. 

For  unsaturated  models,  by  results  3  and  4  ML  estimates  exist  when  all  n ,  >  0  and  do 
not  exist  when  any  count  is  zero  in  the  set  of  sufficient  marginal  tables.  Suppose  that  at  least 
one  n,  =  0  but  the  sufficient  marginal  counts  are  all  positive.  For  hierarchical  loglinear 
models,  Glonek  et  al.  (1988)  showed  that  the  positivity  of  the  sufficient  counts  implies  the 
existence  of  ML  estimates  if  and  only  if  the  model  is  decomposable,  which  includes  the 
conditional  independence  models.  Models  having  all  pairs  of  variables  associated,  however, 
are  more  complex.  For  model  (XT,  XZ,  TZ),  ML  estimates  exist  when  only  one  «,  =  0  but 
may  not  exist  when  at  least  two  cells  are  empty.  For  instance,  ML  estimates  do  not  exist  for 
Table  1 0.7,  even  though  all  sufficient  statistics  (the  two-way  marginal  totals)  are  positive. 
See  Exercise  10.36. 

Haberman  showed  that  the  supremum  of  the  likelihood  function  is  finite.  This  motivated 
him  to  define  extended  ML  estimators  of  fi.  These  always  exist  but  may  equal  0  and,  falling 
on  the  boundary,  need  not  have  the  same  properties  as  regular  ML  estimators.  A  sequence 
of  estimates  satisfying  the  model  that  converges  to  the  extended  estimate  has  log  likelihood 


Table  10.7  Data  for  Which  ML  Estimates  Do  Not 
Exist  for  Model  (XY,  XZ,  YZf 


Z: 

1 

I 

2 

X 

Y: 

1 

2 

1  2 

1 

0 

* 

*  * 

2 

* 

* 

*  0 

'Cells  containing  *  may  contain  any  positive  numbers. 
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approaching  its  supremum.  In  this  extended  sense,  //,,  =  0  is  the  ML  estimate  of  /z,  for  the 
saturated  model  when  «,  =  0  and  some  loglinear  parameter  estimates  are  infinite.  Lauritzen 
(1996)  gave  related  results. 

When  a  sufficient  marginal  count  for  a  factor  equals  zero,  infinite  estimates  occur  for  that 
term.  For  instance,  when  a  AY  marginal  total  equals  zero,  infinite  estimates  occur  among 
[A* y }  for  loglinear  models  such  as  (AY,  XZ,  YZ).  Sometimes,  however,  not  even  infinite 
estimates  exist.  An  example  is  estimating  the  log  odds  ratio  when  both  entries  in  a  column 
of  a  2  x  2  table  equal  0. 

A  value  of  oo  (or  — oo)  for  a  ML  parameter  estimate  implies  that  ML  fitted  values  equal 
0  in  some  cells,  and  some  odds  ratio  estimates  equal  oo  or  0.  One  potential  indicator  is 
when  the  iterative  fitting  process  does  not  converge,  typically  because  an  estimate  keeps 
increasing  from  cycle  to  cycle.  Some  software,  however,  is  fooled  after  a  certain  point  in 
the  iterative  process  by  the  nearly  flat  likelihood.  It  reports  convergence,  but  because  of  the 
very  slight  curvature  of  the  log  likelihood,  the  estimated  standard  errors  are  extremely  large 
and  numerically  unstable.  (Similar  behavior  occurs  for  logistic  models,  as  we  discussed  in 
Section  6.5.)  A  danger  with  sparse  data  is  that  you  might  not  realize  that  a  true  estimated 
effect  is  infinite  and,  as  a  consequence,  report  estimated  effects  and  results  of  statistical 
inferences  that  are  invalid. 

Many  ML  analyses  are  unharmed  by  empty  cells.  Even  when  a  parameter  estimate  is 
infinite,  this  is  not  fatal  to  data  analysis.  The  profile  likelihood  confidence  interval  for  the 
true  log  odds  ratio  has  one  endpoint  that  is  finite.  For  instance,  when  n\  \  =  0  but  other 
rijj  >  0  in  a  2  x  2  table,  log  0  =  —  oo  and  a  confidence  interval  has  form  (— oo,  U)  for 
some  finite  upper  bound  U. 

10.6.3  Effects  of  Sparseness  on  X2,  G2,  and  Model-Based  Tests 

Section  3.2.3  discussed  the  adequacy  of  chi-squared  approximations  for  tests  of  indepen¬ 
dence.  Similar  remarks  apply  more  generally.  Although  empty  cells  and  sparse  tables  need 
not  affect  model  parameter  estimates,  they  can  cause  sampling  distributions  of  goodness- 
of-fit  statistics  to  be  far  from  chi-squared.  The  true  sampling  distributions  converge  to 
chi-squared  as  n  — »  oo,  for  a  fixed  number  of  cells  N .  The  adequacy  of  the  chi-squared 
approximation  depends  both  on  n  and  N. 

The  size  of  n/N  that  produces  adequate  approximations  for  X 2  tends  to  decrease  as  N 
increases  (Koehler  and  Larntz  1980).  For  fixed  n  and  N,  the  chi-squared  approximation 
is  better  for  tests  with  smaller  df.  For  instance,  in  testing  conditional  independence  in 
/  x  J  x  K  tables,  G2[(A:Z,  YZ)\(XY,  XZ,  YZ)]  [with  df  =  (/  -  1)(/  -  1)]  is  closer  to 
chi-squared  than  G2(XZ,  Y  Z)  [with  df  =  K{1  —  1)(7  —  1)].  The  ordinal  test  of  Hq\  f3  =  0 
with  the  homogeneous  li  near-by- linear  AY  association  model  ( 10.9)  has  df  =  Land  behaves 
even  better. 

The  model-based  statistics  G2(Mq\M\)  and  X2(Mq\M\)  depend  on  the  data  only 
through  the  fitted  values,  and  hence  only  through  minimal  sufficient  statistics  for  the  more 
complex  model.  These  statistics  have  null  distributions  converging  to  chi-squared  as  the 
expected  values  of  the  minimal  sufficient  statistics  grow.  For  most  loglinear  models,  these 
sufficient  statistics  refer  to  marginal  tables.  Marginal  totals  are  more  nearly  normally 
distributed  than  are  single  cell  counts.  Thus,  G2{Mq\M\)  and  X2(Mq\M i)  converge  to 
their  limiting  chi-squared  distribution  more  quickly  than  do  G2(Mq)  and  X2(Mq ),  which 
depend  also  on  individual  cell  counts.  When  {/<.,}  are  small  but  the  sufficient  marginal 
totals  for  M\  are  mostly  in  at  least  the  range  5  to  10,  the  chi-squared  approximation  is 
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usually  adequate  for  model  comparison  statistics.  Haberman  (1977a)  provided  theoretical 
justification. 


10.6.4  Alternative  Sparse  Data  Asymptotics 

When  large-sample  approximations  are  inadequate,  exact  small-sample  methods  are  an 
alternative.  When  they  are  infeasible,  it  is  often  possible  to  approximate  exact  distributions 
precisely  using  Monte  Carlo  methods. 

An  alternative  approach  uses  sparse  asymptotic  approximations  that  apply  when  the 
number  of  cells  N  increases  as  n  increases.  For  this  approach,  {/x,- }  need  not  increase,  as 
they  must  do  in  the  usual  (fixed  N ,n  — >  oo)  large-sample  theory.  Chi-squared  statistics 
then  have  approximate  normal  distributions.  See  Note  10.9. 


10.6.5  Adding  Constants  to  Cells  of  a  Contingency  Table 

One  way  to  obtain  finite  estimates  of  all  effects  and  ensure  convergence  of  fitting  algorithms 
is  to  add  a  small  constant  to  cell  counts.  Some  algorithms  add  ^  to  each  cell,  as  Goodman 
(1964b,  1970,  1971a)  recommended  for  saturated  models.  An  example  of  the  beneficial 
effect  of  this  for  a  saturated  model  is  bias  reduction  for  estimating  an  odds  ratio  in  a  2  x  2 
table  (Gart  and  Zweifel  1967).  Adding  \  to  each  cell  before  fitting  an  unsaturated  model 
smooths  the  data  too  much,  however,  often  causing  havoc  with  sampling  distributions.  This 
operation  has  too  conservative  an  influence  on  estimated  effects  and  test  statistics.  The 
effect  is  very  severe  with  a  large  number  of  cells. 

Even  for  a  saturated  model,  adding  i  to  each  cell  is  not  a  panacea  for  all  purposes. 
When  the  ordinary  ML  estimate  of  an  odds  ratio  is  oo,  the  estimate  after  adding  j  to  each 
cell  is  finite,  as  is  the  upper  endpoint  of  any  confidence  interval.  However,  unless  you 
prefer  a  Bayesian  approach  with  prior  information,  it  may  be  more  sensible  to  use  an  upper 
bound  of  oo,  since  no  sample  evidence  suggests  that  the  odds  ratio  falls  below  any  given 
value.  (Some  confidence  interval  methods  that  add  constants  to  the  data  before  using  Wald 
formulas,  such  as  in  Exercise  1 .25,  are  intended  merely  to  approximate  better  intervals  such 
as  score-test-based  intervals.) 

When  in  doubt  about  the  effect  of  sparse  data,  perform  a  sensitivity  analysis.  For  example, 
for  each  possibly  influential  observation,  delete  it  or  move  it  to  another  cell  to  see  how 
results  vary  with  small  perturbations  to  the  data.  Influence  diagnostics  for  GLMs  (Williams 
1987)  are  also  useful  for  this  purpose.  Often,  some  associations  are  not  affected  by  empty 
cells  and  give  stable  results  for  the  various  analyses,  whereas  others  that  are  affected  are 
highly  unstable.  Use  caution  in  making  conclusions  about  an  association  if  small  changes 
in  the  data  are  influential. 

Other  ways  exist  to  smooth  data  in  a  less  ad  hoc  manner  than  adding  arbitrary  constants 
to  cells.  These  include  penalized  likelihood  methods  (Section  7.4.5)  and  Bayesian  methods 
as  discussed  next. 


10.7  BAYESIAN  LOGLINEAR  MODELING 

We’ve  just  seen  that  when  data  are  sparse,  ML  estimates  of  some  model  parameters  may 
be  infinite.  By  contrast,  the  Bayesian  approach  merges  prior  information  with  the  sample 
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data,  and  usually  provides  shrinkage  by  which  the  posterior  mean  estimate  of  a  parameter 
is  finite,  as  are  both  endpoints  of  posterior  intervals. 


10.7.1  Estimating  Loglinear  Model  Parameters  in  Two-Way  Tables 

For  two-way  tables,  Lindley  (1964)  proposed  Bayesian  inference  for  association  parameters, 
using  a  Dirichlet  prior  distribution  and  its  limiting  improper  prior  for  the  multinomial. 
Fie  showed  that  contrasts  of  log  cell  probabilities,  such  as  the  log  odds  ratio,  have  an 
approximate  normal  posterior  distribution.  Using  the  same  structure,  Bloch  and  Watson 
(1967)  provided  improved  approximations  to  the  posterior  distribution  and  also  considered 
linear  combinations  of  the  cell  probabilities. 

A  disadvantage  of  a  Dirichlet  prior  distribution  is  that  it  does  not  allow  for  placing 
structure  on  the  probabilities,  such  as  corresponding  to  a  loglinear  model.  We  could  instead 
put  prior  distributions  on  parameters  of  a  loglinear  model.  Exchangeability  within  each 
set  of  loglinear  parameters  may  be  more  sensible  than  the  exchangeability  of  multinomial 
probabilities  that  we  get  with  a  Dirichlet  prior.  For  the  saturated  model  for  two-way  tables, 
a  simple  approach  treats  {A.*},  (a1'  },  and  {k*Y }  as  a  priori  independent  with  normal  priors. 
We  could  recognize  ordinality  by  instead  using  an  association  model. 

Leonard  (1975)  used  a  hierarchical  approach  with  the  saturated  model.  For  each  of 
{a*  },  {a*  },  (a*k  },  given  a  mean  p  and  variance  a2,  the  first-stage  prior  takes  them  to  be 
independent  and  N (ji,  cr2).  At  the  second  stage  each  normal  mean  is  assumed  to  have  an 
improper  uniform  distribution  over  the  real  line,  and  a2  is  assumed  to  have  an  inverse  chi- 
squared  distribution.  Laird  (1978),  building  on  this  approach,  estimated  cell  probabilities 
using  an  empirical  Bayesian  approach.  This  approach  replaces  a2  by  the  mode  of  the 
marginal  likelihood,  after  integrating  out  the  loglinear  parameters.  The  analysis  shrinks  cell 
proportion  estimates  toward  the  fit  of  the  independence  model.  As  o  -*■  oo,  the  estimates 
converge  to  the  sample  proportions;  as  a  — >  0,  they  converge  to  the  independence  estimates, 
{pi+p+j}.  The  fitted  values  have  the  same  row  and  column  marginal  totals  as  the  observed 
data.  She  noted  that  the  use  of  a  symmetric  Dirichlet  prior  results  in  posterior  mean  estimates 
for  cell  probabilities  that  correspond  to  sample  proportions  after  adding  the  same  count  to 
each  cell,  whereas  her  approach  permits  considerable  variability  in  the  amount  added  or 
subtracted  from  each  cell  to  get  the  estimates. 


10.7.2  Example:  Polarized  Opinions  by  Political  Party 

In  recent  years  there  has  been  increasing  polarization  in  the  United  States  between 
Democrats  and  Republicans  on  a  variety  of  issues,  such  as  whether  global  warming  is 
occurring  and  whether  the  government  should  take  steps  to  try  to  slow  it,  whether  abortion 
should  be  legal,  whether  homosexuals  should  have  the  right  to  marry,  whether  health  care 
should  be  provided  for  all  Americans,  whether  George  W.  Bush’s  tax  cuts  for  the  very 
wealthy  should  be  rescinded,  and  even  whether  Barack  Obama  was  born  in  the  United 
States.  Contingency  tables  cross-classifying  many  such  variables  can  be  created  from  re¬ 
sults  of  General  Social  Surveys  at  sda .  berkeley .  edu/GSS  using  such  variable  names  as 
PARTYID,  GRNTAXES,  ABANY,  and  MARHOMO. 

Table  10.8  cross-classifies  those  who  identify  as  strongly  Democratic  or  strongly  Re¬ 
publican  by  opinion  about  homosexual  marriage,  for  the  2010  GSS  for  respondents  under 
the  age  of  40.  Of  strong  Democrats,  39%  strongly  agreed  that  homosexuals  should  have  the 
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Table  10.8  Political  Party  Identification  and  Opinions  About  Homosexual  Marriage 


Homosexuals  Should  Have  Right  to  Marry 

Party 

Identification 

Strongly 

Agree 

Agree 

Neutral 

Strongly 

Disagree 

Disagree 

Total 

Strong  Democrat 

23 

14 

10 

7 

5 

59 

Strong  Republican 

0 

6 

2 

7 

9 

24 

right  to  marry,  but  0%  of  strong  Republicans  felt  this  way.  In  restricting  the  sample  by  age 
and  to  those  with  strong  party  ID,  the  sample  size  is  small  ( n  =  83).  We  regard  ;?2i  =  0  as 
a  sampling  zero. 

Any  of  the  Bayesian  strategies  just  mentioned  would  yield  a  positive  posterior  proba¬ 
bility  for  the  empty  cell.  Here,  we  recognize  the  ordinality  of  the  columns  by  fitting  the 
linear-by-linear  model  (10.5)  in  its  uniform  association  form  with  [vj  =  j}.  With  uninfor¬ 
mative  priors,  the  posterior  distribution  suggests  a  strong  association.  For  example,  using 
independent  N( 0,  a2)  prior  distributions  for  all  parameters  with  a  =  1000  results  in  the 
uniform  log  local  odds  ratio  parameter  /3  having  a  posterior  mean  of  0.849  and  a  poste¬ 
rior  standard  deviation  of  0.213.  The  implied  fitted  odds  ratio  for  the  four  corner  cells  is 
exp[4(0.849)]  %  30. 

To  find  the  posterior  probability  of  classification  in  the  empty  cell  (conditional  on 
being  a  strong  Democrat  or  strong  Republican),  we  could  take  the  parametric  expression 
H2\/(E,  My)  f°r  this  probability  and  integrate  it  with  respect  to  the  joint  posterior 
distribution  of  the  model  parameters.  More  simply,  using  standard  output,  we  could  evaluate 
the  probability  at  the  posterior  means  of  all  the  parameters.  This  is  not  identical  but  gives 
similar  information.  For  Table  10.8,  this  gives  a  cell  fitted  value  of  1 .46,  an  estimated  cell 
probability  of  0.02  and  an  estimated  conditional  probability  that  strong  Republicans  make 
the  strongly  agree  response  of  0.06. 

Results  using  ML  are  similar,  with  f  =  0.813  ( SE  —  0.207).  The  model  fits  adequately, 
with  X2  =  6.26  (df  =  3).  Bayesian  analyses  for  checking  model  fit  are  beyond  our  scope 
here.  Spiegelhalter  et  al.  (2002)  presented  a  mean  posterior  deviance  for  checking  fit  and  a 
deviance  information  criterion  for  comparing  models. 


10.7.3  Bayesian  Loglinear  Modeling  of  Multidimensional  Tables 

Knuiman  and  Speed  (1988)  generalized  Leonard’s  loglinear  modeling  approach  by  con¬ 
sidering  multiway  tables  and  by  taking  a  multivariate  normal  prior  for  all  parameters 
collectively  rather  than  univariate  normal  priors  on  individual  parameters.  This  permits 
separate  specification  of  prior  information  for  different  interaction  terms.  They  applied  this 
to  unsaturated  models,  computing  the  posterior  mode  and  using  the  curvature  of  the  log 
posterior  at  the  mode  to  measure  precision.  King  and  Brooks  (2001 )  also  specified  a  multi¬ 
variate  normal  prior  on  the  loglinear  parameters,  which  induces  a  multivariate  log-normal 
prior  on  the  expected  cell  counts.  They  derived  the  parameters  of  this  distribution  in  an 
explicit  form  and  stated  the  corresponding  mean  and  covariances  of  the  cell  counts. 

We’ve  seen  that  with  ML  we  can  analyze  a  multinomial  loglinear  model  using  a  corre¬ 
sponding  Poisson  loglinear  model,  before  conditioning  on  the  sample  size.  Forster  (2010) 
found  corresponding  Bayesian  results,  also  using  a  multivariate  normal  prior  on  the  model 
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parameters.  He  discussed  conditions  for  prior  distributions  such  that  marginal  inferences  are 
equivalent  for  Poisson  and  multinomial  models.  These  essentially  allow  the  hyperparameter 
governing  the  overall  size  of  the  cell  means,  which  disappears  after  the  conditioning  that 
yields  the  multinomial  model,  to  have  an  improper  prior.  Forster  also  derived  necessary  and 
sufficient  conditions  for  the  posterior  to  then  be  proper,  and  he  related  them  to  conditions 
for  ML  estimates  to  be  finite. 

Spiegelhalter  and  Smith  (1982)  gave  an  approximate  expression  for  the  Bayes  factor  for 
a  multinomial  loglinear  model  with  an  improper  prior  (uniform  for  the  log  probabilities) 
and  showed  how  it  related  to  the  standard  chi-squared  goodness-of-fit  statistic.  Raftery 
(1986)  noted  that  this  approximation  is  indeterminate  if  any  cell  is  empty  but  is  valid  with  a 
Jeffreys  prior.  He  also  noted  that,  with  large  samples,  —2  times  the  log  of  this  approximate 
Bayes  factor  is  approximately  equivalent  to  Schwarz’s  BIC  model  selection  criterion. 


10.7.4  Graphical  Conditional  Independence  Models 

We’ve  seen  in  Section  10.1  that  loglinear  conditional  independence  structure  can  be  sum¬ 
marized  by  a  graph  with  vertices  for  the  variables  and  edges  between  vertices  to  represent 
a  conditional  association.  The  cell  probabilities  can  be  expressed  in  terms  of  marginal  and 
conditional  probabilities.  Independent  Dirichlet  prior  distributions  for  them  induce  inde¬ 
pendent  Dirichlet  posterior  distributions.  O’Hagan  and  Forster  (2004,  Chap.  12)  showed 
the  usefulness  of  graphical  representations  for  a  variety  of  Bayesian  analyses. 

Dawid  and  Lauritzen  (1993)  introduced  the  notion  of  a  probability  distribution  defined 
over  probability  measures  on  a  multivariate  space  that  concentrate  on  a  set  of  such  graphs. 
A  special  case  includes  a  hyper  Dirichlet  distribution  that  is  conjugate  for  multinomial 
sampling  and  that  implies  that  certain  marginal  probabilities  have  a  Dirichlet  distribution. 
Madigan  and  Raftery  (1994)  and  Madigan  and  York  (1995)  used  this  family  for  graphical 
model  comparison  and  for  constructing  posterior  distributions  for  measures  of  interest 
by  averaging  over  relevant  models.  Madigan  and  York  showed  how  Bayesian  graphical 
models  unify  many  standard  discrete  data  problems.  They  proposed  a  Monte  Carlo  method 
for  Bayesian  model  averaging.  A  disadvantage  of  this  approach  is  that  it  applies  only  for 
decomposable  graphical  models. 

Massam  et  al.  (2009)  presented  conjugate  priors  for  the  loglinear  parameters  subject  to 
baseline  constraints  for  multinomial  sampling.  The  induced  prior  on  the  cell  probabilities 
is  a  generalization  of  the  hyper  Dirichlet  prior  to  nondecomposable  graphical  models  as 
well  as  other  hierarchical  loglinear  models. 


NOTES 

Section  10.1:  Conditional  Independence  Graphs  and  Collapsibility 
10.1  Graphical  models:  For  expositions  on  graphical  models  and  their  conditional  indepen¬ 
dence  structure,  see  Anderson  and  Bockenholt  (2000),  Colombi  and  Giordano  (2012),  Do¬ 
bra  (2003),  Edwards  (2000),  Edwards  and  Kreiner  (1983),  Gottard  et  al.  (201 1),  Lauritzen 
(1996),  Madigan  and  York  (1995),  Marchetti  and  Lupparelli  (2010),  Ravikumaret  al.  (2010), 
Wermuth  and  Lauritzen  (1983),  and  Whittaker  (1990).  Whittaker  ( 1990,  Sec.  12.5)  summa¬ 
rized  connections  with  various  definitions  of  collapsibility.  Khamis  (2011)  presented  an 
alternative  representation  using  multigraphs,  in  which  the  vertices  represent  the  cliques  of 
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the  model  and  the  edges  correspond  to  variables  shared  by  pairs  of  cliques.  For  modeling 
social  networks  using  graphs,  see  Anderson  et  al.  (1999)  and  references  therein. 

10.2  Perfect  tables:  Darroch  (1962)  defined  a  three-way  table  as  perfect  if  for  all  /,  /,  k , 


E 

E 


ttjj+  lti+k 
tti++ 

ttj+k  tt+jk 
tt++k 


—  n+j+  7r++t, 

=  7T,++  7T+j+. 


V-^  n+jk  ttjj+ 
:  7r+y+ 


tti++  lt++k. 


For  perfect  tables,  homogeneous  association  implies  that 


\tt,jk  —  ttjj+  7T;+k  7t+jk/7Ti++  Tt+j+  TT-h-*} 

and  conditional  odds  ratios  are  identical  to  marginal  odds  ratios.  Whittemore  (1978)  used 
perfect  tables  to  illustrate  that  for  /  x  J  x  K  tables  with  K  >  2,  conditional  and  marginal 
odds  ratios  can  be  identical  even  when  no  pair  of  variables  is  conditionally  independent.  See 
also  Davis  (1986b). 


Section  10.2:  Model  Selection  and  Comparison 

10.3  Model  selection:  For  loglinear  model  selection,  see  Aitkin  (1979),  Benedetti  and  Brown 
(1978),  Brown  (1976),  Dahinden  et  al.  (2010),  Goodman  (1970,  1971a),  and  Wennuth 
(1976).  When  a  certain  model  holds,  G2/df  has  an  asymptotic  mean  of  1.  Goodman  (1971a) 
recommended  this  index  for  comparing  fits.  Smaller  values  represent  better  fits. 

10.4  Partitioning  chi-squared:  Kullback  et  al.  (1962)  and  Lancaster  (1951)  proposed  partition¬ 
ings  of  chi-squared  statistics  in  multiway  tables.  Goodman  ( 1 970)  and  Plackett  ( 1 962)  noted 
difficulties  with  their  approaches.  Lang  (1996b)  discussed  partitionings  for  more  complex 
models. 


Section  10.4:  Modeling  Ordinal  Associations 

10.5  L  x  L,  R,  C  models:  Birch  (1965),  Goodman  (1979a),  and  Haberman  (1974b)  introduced 
special  cases  of  the  linear-by-linear  association  model.  Haberman  (1974b)  expressed  the 
X*y  association  term  with  an  expansion  in  orthogonal  polynomials.  The  row  effects  and 
column  effects  models  were  developed  by  Goodman  ( 1 979a),  Haberman  ( 1 974b),  and  Simon 
(1974).  For  more  general  ordinal  loglinear  models  for  multiway  tables,  see  Agresti  (2010) 
and  references  therein  on  pp.  180-181,  Becker  (1989a),  Becker  and  Clogg  (1989),  and 
Goodman  (1986). 


Section  10.5:  Generalized  Loglinear  and  Association  Models,  Correlation  Models,  and 
Correspondence  Analysis 

10.6  RC  models:  Early  articles  on  the  RC  model  include  Goodman  (1979a,  1 98 1  a,b)  and  An¬ 
dersen  (1980,  pp.  210-216),  apparently  partly  motivated  by  earlier  work  of  G.  Rasch  (see 
Andersen  1995).  Anderson  and  Bockenholt  (2000),  Becker  (1989a,b,  1990),  Becker  and 
Clogg  (1989),  Choulakian  (1988),  de  Rooij  (2008),  Goodman  (1985,  1986,  1996),  and 
Wong  (2010)  discussed  generalizations  for  multiway  tables.  Xie  (1992)  adapted  it  for  com¬ 
paring  mobility  tables.  Anderson  (1984)  discussed  a  related  model,  called  the  stereotype 
model,  as  a  special  case  of  a  baseline-category  logit  model  using  parameter  response  scores. 
Anderson  and  Vermunt  (2000)  showed  that  RC  and  related  association  models  arise  when 
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observed  variables  are  conditionally  independent  given  a  latent  variable  that  is  conditionally 
normal,  given  the  observed  variables.  Their  work  generalizes  results  in  Lauritzen  and  Wer- 
muth  (1989)  and  discussion  by  Whittaker  of  van  der  Heijden  et  al.  (1989).  Anderson  and 
Yu  (2007)  used  association  models  for  item  response  modeling,  de  Rooij  and  Heiser  (2005) 
discussed  graphical  representations  for  the  RC(M)  model.  Clogg  and  Shihadeh  (1994)  sur¬ 
veyed  association  models  and  related  correlation  models.  Wong  (2010)  and  Molenberghs 
and  Verbeke  (2005,  Chap.  6)  surveyed  association  models,  with  the  latter  using  global  odds 
ratios  in  many  models. 

10.7  Correlation  models:  Kendall  and  Stuart  (1979,  Chap.  33)  presented  canonical  correlation 
methods  for  contingency  tables.  See  also  Williams  (1952),  who  discussed  earlier  work  by 
R.  A.  Fisher  and  others.  Karl  Pearson  often  analyzed  tables  by  assuming  an  underlying 
bivariate  normal  distribution  (Section  17.1).  For  estimating  that  distribution's  correlation, 
see  Becker  (1989b),  Goodman  (1981a,b,  1985),  Kendall  and  Stuart  (1979,  Chaps.  26  and 
33),  Lancaster  (1969,  Chap.  X),  the  Pearson  (1904)  tetrachoric  correlation  for  2x2  tables, 
and  the  Lancaster  and  Hamdan  (1964)  polychoric  correlation  for  /  x  J  tables.  Gilula  (1984) 
related  the  model  to  latent  class  models  for  a  two-way  table.  Gilula  and  Flaberman  (1988) 
analyzed  multiway  tables  with  correlation  models  by  treating  explanatory  variables  as  a 
single  variable  and  response  variables  as  a  second  variable. 

10.8  Correspondence  analysis  (CA):  CA  gained  popularity  in  France  under  the  influence  of 
Benzecri  (1973).  Goodman  (1996)  attributed  its  origins  to  H.  O.  Hartley,  publishing  under 
his  German  birth  name  (Hirschfeld,  1935).  Greenacre  (2007)  related  it  to  the  singular  value 
decomposition  of  a  matrix.  For  other  discussion,  see  Choulakian  (1988),  Escoufier  (1982), 
Goodman  ( 1 986, 2000),  Greenacre  ( 1 988),  Michailidis  and  de  Leeuw  ( 1 998),  van  der  Heijden 
and  de  Leeuw  (1985),  and  van  der  Heijden  et  al.  (1989).  Gabriel  (1971)  discussed  related 
work  on  biplots. 


Section  10.6:  Empty  Cells  and  Sparseness  in  Modeling  Contingency  Tables 

10.9  Sparse  ML:  For  Monte  Carlo  approximation  of  exact  small-sample  distributions,  see  Booth 
and  Butler  (1999),  Forster  et  al.  (1996),  and  Kim  and  Agresti  (1997).  For  sparse  asymptotics 
with  goodness-of-fit  testing  of  a  specified  multinomial,  Koehler  and  Larntz  (1980)  showed 
that  a  standardized  version  of  G2  has  an  approximate  normal  distribution.  Osius  and  Rojek 
(1992)  considered  this  for  X2  and  G2  for  multinomials  with  estimated  parameters.  Koehler 
(1986)  presented  limiting  normal  distributions  for  G2  for  use  in  testing  models  having  direct 
ML  estimates.  McCullagh  (1986)  reviewed  ways  of  handling  sparse  tables  and  presented 
an  alternative  approximation  for  G2.  Zelterman  (1987)  gave  normal  approximations  for 
X1  and  proposed  an  alternative  statistic.  Morris  (1975)  showed  asymptotic  normality  for  a 
wide  class  of  functions  of  multinomial  counts  including  chi-squared  statistics,  and  Cressie 
and  Read  (1984)  showed  similar  results  for  the  power  divergence  statistic  (Exercise  1.34). 
Simonoff  ( 1 986)  proposed  a  jackknife  estimate  of  the  variance  of  chi-squared  statistics  under 
composite  hypotheses  with  the  sparse  asymptotic  framework.  For  more  on  ML  estimation 
when  a  sparse  table  has  empty  cells,  see  Eriksson  et  al.  (2006)  and  Fienberg  and  Rinaldo 
(2011).  When  a  pattern  of  empty  cells  forces  certain  fitted  values  for  a  model  to  equal  0,  this 
affects  the  df  for  testing  mode!  fit  (Haslett  1990). 


Section  10.7:  Bayesian  Loglinear  Modeling 

10.10  Bayesian  loglinear/association  modeling:  Dellaportas  and  Forster  (1999)  and  Ntzoufras 
et  al.  (2000)  developed  MCMC  algorithms  for  choosing  among  many  loglinear  models  for 
a  high-dimensional  table.  Kateri  et  al.  (2005)  provided  a  Bayesian  analysis  of  Goodman’s 
RC(M)  model.  Fienberg  and  Makov  (1998)  applied  Bayesian  loglinear  modeling  to  issues 
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of  confidentiality,  accounting  for  model  uncertainty  via  Bayesian  model  averaging.  Agen¬ 
cies  often  release  multidimensional  contingency  tables  that  are  ostensibly  confidential,  but 
the  confidentiality  can  be  broken  if  an  individual  is  uniquely  identifiable  from  the  data 
presentation.  For  related  work  in  terms  of  ML  estimates,  see  Dobra  et  al.  (2009). 


EXERCISES 

Applications 

10.1  Refer  to  the  models  having  fits  summarized  in  Table  10.2. 

a.  Which  of  these  models  are  graphical  loglinear  models?  Why? 

b.  Explain  why  the  model  symbolized  by  {ACM,  ACR,  AMG)  is  a  decomposable, 
graphical  loglinear  model.  Show  that  it  fits  well,  with  G 2  —  14.78  (df  =  16). 
Interpret  the  fit.  (Thanks  to  Giovanni  Marchetti  for  pointing  this  out.) 


10.2  Use  Table  9.3  to  illustrate  the  odds  ratio  collapsibility  conditions. 

a.  For  model  (A,  C,  M),  all  conditional  odds  ratios  equal  1.0.  Explain  why  all 
reported  marginal  odds  ratios  equal  1 .0. 

b.  For  model  (AC,  M),  explain  why  (i)  all  conditional  odds  ratios  are  the  same  as 
the  marginal  odds  ratios,  and  (ii)  all  pLac+  =  nac+. 

c.  For  model  (AM,  CM),  explain  why  (i)  the  AC  conditional  odds  ratios  of  1 .0 
need  not  be  the  same  as  the  AC  marginal  odds  ratio,  and  (ii)  the  AM  and 
CM  conditional  odds  ratios  are  the  same  as  the  marginal  odds  ratios  and  all 
fi'a+m  =  fta+m  and  fi  =  n+cm. 

d.  For  model  (AC,  AM,  CM),  explain  why  (i)  no  conditional  odds  ratios  need  be 
the  same  as  the  related  marginal  odds  ratios,  and  (ii)  the  fitted  marginal  odds 
ratios  must  equal  the  sample  marginal  odds  ratios. 


10.3  Refer  to  the  collapsibility  condition  in  Section  10.1.4.  For  loglinear  model 
{WX,  WY ,  WZ),  what  is  the  impact  of  collapsing  over  X ,  on  the  other  associations? 
Contrast  that  with  what  the  conditions  suggest,  treating  group  C  =  (X),  (i)  if 
A  =  (Z)  and  B  =  [W,  T),  and  (ii)  if  A  =  {T,  Z)  and  B  =  { W).  This  shows  that 
different  groupings  for  that  condition  can  give  different  information. 


10.4  Table  10.9  summarizes  a  study  with  variables  age  of  mother  (A),  length  of  ges¬ 
tation  (G)  in  days,  infant  survival  (/),  and  number  of  cigarettes  smoked  per  day 
during  the  prenatal  period  (S).  Treat  G  and  I  as  response  variables  and  A  and  S  as 
explanatory. 

a.  Explain  why  a  loglinear  model  should  include  the  XAS  term. 

b.  Fit  the  models  ( AGIS ),  (AG/,  AIS,  AGS,  GIS),  (AG.  AI.  AS,  GI,  GS,  IS),  and  {AS, 
G,  /).  Identify  a  subset  of  models  nested  between  two  of  these  that  may  fit  well. 

c.  Use  (i)  forward  selection  and  (ii)  backward  elimination  to  build  a  model  between 
two  of  the  models  listed  in  (b).  Compare  the  results  of  the  strategies,  and  interpret 
the  models  chosen. 
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Table  10.9  Data  on  Gestation  and  Infant  Survival  for  Exercise  10.4 


Infant  Survival 


Age 

Smoking 

Gestation 

No 

Yes 

<30 

<5 

<260 

50 

315 

>260 

24 

4012 

5+ 

<260 

9 

40 

>260 

6 

459 

30± 

<5 

<260 

41 

147 

>260 

14 

1594 

5+ 

<260 

4 

1 1 

>260 

1 

124 

Source:  N.  Wermuth,  pp.  279-295  in  Proc.  9th  International  Biometrics  Conference, 
Vol.  1  (1976).  Reprinted  with  permission  from  the  Biometric  Society. 


10.5  Consider  loglinear  model  selection  for  Table  6.3  on  P  =  premarital  sex,  E  = 
extramarital  sex,  M  =  marital  status,  and  G  —  gender. 

a.  Why  is  it  not  sensible  to  consider  models  that  omit  the  XGM  term? 

b.  Using  forward  selection  starting  with  ( GM ,  E,  P),  show  that  model  ( GM ,  GP , 
EG,  EMP)  seems  reasonable. 

c.  Using  backward  elimination,  show  that  (GM,  GP,  EMP)  or  (GM,  GP,  EG,  EMP) 
seems  reasonable. 

d.  Show  that  the  estimated  EMP  interaction  suggests  that  the  effect  of  extramarital 
sex  on  divorce  is  greater  for  subjects  who  had  no  premarital  sex. 

e.  Use  residuals  to  describe  the  lack  of  fit  of  model  (GM,  EMP). 

10.6  Refer  to  the  model  building  in  Section  10.2.2.  Fit  model  (7)  in  Table  10.2.  Explain 
how  to  inteipret  the  effects  of  race  and  gender  on  these  responses. 

10.7  For  model  (AC,  AM,  CM)  with  Table  9.3,  the  standardized  residual  in  each  cell 
equals  ±0.63.  Inteipret,  and  explain  why  each  one  has  the  same  absolute  value. 
By  contrast,  model  (AM,  CM)  has  standardized  residual  ±3.70  in  each  cell  where 
M  =  yes  (e.g.,  ±3.70  when  A  =  C  =  yes)  and  ±12.80  in  each  cell  where  M  =  no 
(e.g.,  ±12.80  when  A  —  C  —  yes).  Interpret. 

10.8  For  Table  9.8  on  auto  injuries,  conduct  a  residual  analysis  with  the  model  of  no 
three-factor  interaction  to  describe  the  nature  of  the  interaction. 

10.9  Refer  to  the  data  in  Exercise  3.15  on  income  and  job  satisfaction. 

a.  Perform  a  residual  analysis  for  the  independence  model.  Explain  why  it  suggests 
that  the  linear-by-linear  association  model  may  fit  better.  Fit  it,  compare  to  the 
independence  model,  and  interpret. 

b.  Using  standardized  scores,  find  and  interpret  j3. 

10.10  For  Table  10.8  on  opinions  about  homosexual  marriage,  fit  the  linear-by-linear 
association  model  and  interpret. 
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10.11  A  weak  local  association  may  be  substantively  important  for  nonlocal  categories. 
Illustrate  with  the  L  x  L  model  for  Table  10.5  on  mental  impairment  and  parents’ 
SES,  showing  how  the  estimated  odd  ratio  for  the  four  corner  cells  compares  to  the 
estimated  local  odds  ratio. 

10.12  Refer  to  Table  10.3  on  birth  control  and  premarital  sex. 

a.  Fit  the  column  effects  model,  using  equally  spaced  row  scores.  Compare  esti¬ 
mated  column  scores  to  the  equal-interval  scores  for  the  L  x  L  model.  Test 
that  the  true  column  scores  are  equal-interval,  given  that  the  model  holds. 
Interpret. 

b.  The  nature  of  the  row  categories  suggests  a  special  case  of  the  row  effects  model 
with  the  spacing  between  rows  1  and  2  the  same  as  between  rows  3  and  4.  Fit  this 
model  with  equally  spaced  column  scores,  test  its  goodness  of  fit,  and  interpret 
parameter  estimates. 

10.13  Refer  to  the  previous  exercise.  Fit  the  RC  model.  Interpret  the  estimated  scores. 
Does  it  fit  significantly  better  than  the  uniform  association  model? 

10.14  Replicate  the  results  in  Section  10.5.5  for  the  correlation  and  correspondence 
models  with  Table  10.5  on  mental  impairment  and  parents’  SES. 

10.15  Analyze  Table  10.5  on  mental  impairment  using  ordinal  logistic  models.  Interpret, 
and  discuss  advantages/disadvantages  compared  with  using  association  models, 
correlation  models,  and  correspondence  analysis. 

10.16  Download  data  from  the  2010  General  Social  Survey  at  sda.berkeley.edu/GSS 
cross-classifying  opinion  about  paying  higher  taxes  to  help  the  environment  (vari¬ 
able  GRNTAXES)  and  political  party  identification  (PARTYID).  Conduct  Bayesian 
inference  for  an  association  model.  Report  posterior  mean  estimates  and  their  stan¬ 
dard  deviations  for  relevant  terms  describing  the  associations. 

10.17  Conduct  a  Bayesian  loglinear  analysis  for  the  GSS  data  in  Exercise  9.6  on  attitudes 
toward  abortion,  environment,  and  political  party  ID.  Interpret  results,  including 
posterior  intervals  for  conditional  odds  ratios  of  interest. 


Theory  and  Methods 

10.18  Suppose  loglinear  model  (XY ,  XZ,  YZ)  holds.  Find  log  /i,y+  and  explain  why 
marginal  associations  need  not  equal  conditional  associations  for  this  model. 

10.19  Consider  loglinear  model  (WX,  XY,  YZ).  Explain  why  W  and  Z  are  independent 
given  X  alone  or  given  Y  alone  or  given  both  X  and  Y .  When  are  W  and  Y 
conditionally  independent? 

10.20  For  a  four-way  table,  is  the  WX  conditional  association  the  same  as  the  WX  marginal 
association  for  the  loglinear  model  (a)  (WX,  XYZ )?  (b)  (WX,  WZ,  XY,  YZ)‘>  Why? 
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10.21  Loglinear  model  Mq  is  a  special  case  of  loglinear  model  M\ . 

a.  Explain  why  the  fitted  values  for  the  two  models  are  identical  in  the  sufficient 
marginal  distributions  for  Mq. 

b.  Haberman  (1974a)  showed  that  when  (A;)  satisfy  any  model  that  is  a  special 
case  of  Mq.  Alt  log  p,  —  JT  log  /l,.  In  particular,  we  can  regard  Ao  as 
the  orthogonal  projection  of  Ai  onto  the  linear  manifold  of  {log  p\  satisfying 
Mq.  Using  this,  show  that  G2{Mq)  —  G2(M\ )  —  2  JT  A  a  log(Ai</Ao  /)• 

10.22  For  a  three-way  table,  show  that  the  fit  of  mutual  independence  satisifies 

G2[(X,  Y,  Z)]  =  G2[(X,  Z)]  +  G2[{Y,  Z)]  +  G2[(XZ,  YZ)], 

where  models  (X,  Z)  and  (Y,  Z)  refer  to  the  two-way  marginal  tables  (Cheng  et  al. 
2010). 

10.23  For  T  categorical  variables  X  \ , . . . ,  X7-,  explain  why: 

a.  G2(X  ,,  X2,  . . .,  XT)  =  G\XU  X2)  +  G2(X,Z2,  X3)  +  --- 
+  G2(X\X2---XT-i,XT). 

b.  G\X  i  •  •  ■  XT-\,XT)  =  G2(XuXt)  +  G2{XiXt  ,  X\  X2)  +  ■■■ 

+  G\X{X2  ■■■XT-l,XlX2--  XT- 2  XT). 

10.24  For  1x2  contingency  tables,  explain  why  the  linear-by-linear  association  model 
is  equivalent  to  the  linear  logit  model  (5.5)  and  the  column  effects  model,  whereas 
the  row  effects  model  is  equivalent  to  the  saturated  model. 

10.25  Lehmann  (1966)  defined  (X,  Y)  to  be  positively  likelihood-ratio  dependent  if  their 
joint  density  satisfies  f(x\,  y\)f{x2,  y2)  >  f(x i,  >’2) / (X2,  >’1 )  whenever  X\  <  x2 
and  y\  <  y2.  Then,  the  conditional  distribution  of  Y  (X)  stochastically  increases  as 
X  (Y)  increases  (Goodman  1981a). 

a.  For  the  L  x  L  model,  show  that  the  conditional  distributions  of  Y  and  of  X  are 
stochastically  ordered.  What  is  its  nature  if  fl  >  0? 

b.  In  row  effects  model  (10.7),  if  /x,-  >  p./,.  show  that  the  conditional  distribution 
of  Y  is  stochastically  higher  in  row  i  than  in  row  h. 

10.26  Yule  (1906)  defined  a  table  to  be  isotropic  if  an  ordering  of  rows  and  of  columns 
exists  such  that  the  local  log  odds  ratios  are  all  nonnegative  [see  also  Goodman 
(1981a)]. 

a.  Show  that  a  table  is  isotropic  if  it  satisfies  (i)  the  linear-by-linear  association 
model,  (ii)  the  row  effects  model,  and  (iii)  the  RC  model. 

b.  Explain  why  a  table  that  is  isotropic  for  a  certain  ordering  is  still  isotropic  when 
adjacent  rows  or  columns  are  combined. 

10.27  Consider  the  log  likelihood  for  the  linear-by-linear  association  model. 
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a.  Differentiating  with  respect  to  ft  and  evaluating  at  ft  =  0  and  null  estimates  of 
parameters,  show  that  the  score  function  is  proportional  to 

EE  UiVjiPij  -  Pi  +  P+j). 

i  j 

b.  Use  the  delta  method  to  show  that  this  sum  has  null  SE  of 

E«/w+  ~  ( E  «'•*■+ 

_  /  \  / 

c.  Construct  a  score  statistic  for  testing  independence.  Show  that  it  is  essentially 
the  correlation  test  (3.16).  [Hirotsu  (1982)  presented  a  family  of  score  tests  for 
ordered  alternatives.] 

10.28  For  the  row  effects  model  (10.7),  show  that  minimal  sufficient  statistics  are 

{«,  +  },  {n+j},  and  {]T  i  =  1, _ /),  and  show  that  the  likelihood  equations 

equate  these  to  their  expected  values. 

10.29  Show  that  the  column  effects  model  corresponds  to  a  baseline-category  logit  model 
for  Y  that  is  linear  in  scores  for  X,  with  slope  depending  on  the  paired  response 
categories. 

10.30  Refer  to  the  homogeneous  linear-by-linearXF  association  model  (10.9). 

a.  Find  the  likelihood  equations  and  show  that  they  imply  that  the  fitted  marginal 
AT  correlation  equals  the  AT  correlation  for  the  sample  data. 

b.  Find  the  additional  likelihood  equations  that  apply  for  the  heterogeneous  linear- 
by-linear  AT  association  model.  Explain  why,  in  each  stratum,  the  fitted  AT 
correlation  equals  the  sample  correlation. 

10.31  When  model  (AT,  XZ,  YZ)  is  inadequate  and  variables  are  ordinal,  useful  models 
are  nested  between  it  and  ( XYZ ).  For  ordered  scores  {w,  },  {v;  },  and  {vt'*},  consider 

log  Pijk  =  X  +  X?  +  Ay  +  Xf  +  Xjjr  +  XfJ'  +  Xj7  +  PujVjWk. 

Define  0ijk  =  9ijik+])/0ij(k)  =  0i(j+l)k/9iU)k  =  0o+ujk/9O)jk.  For  unit-spaced 
scores,  show  there  is  uniform  interaction,  log  0,jk  =  ft  (Goodman  1979a).  Show 
that  log  odds  ratios  for  any  two  variables  change  linearly  across  levels  of  the  third 
variable. 

10.32  For  a  3  x  3  table  cross-classifying  two  ordinal  variables,  show  that  the  model 
specifying  a  uniform  global  odds  ratio  is  a  special  case  of  the  generalized  loglinear 
model  (10. 10),  by  constructing  the  C,  A,  and  X  matrices.  Explain  why  its  residual 
df  =  3. 

10.33  Explain  why  the  RC  model  requires  scale  constraints  for  the  scores.  Show  that  the 
residual  df  —  (I  —  2 )(J  —  2).  Find  and  interpret  the  likelihood  equations.  Explain 
why  the  fit  is  invariant  to  category  orderings. 
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10.34  For  the  correlation  model  ( 10. 14),  show  (Goodman  1985,  1986): 

a.  X  is  the  correlation  between  the  scores. 

b.  £,  ^i(7Tij/7T+j)  =  Xvj  and  J^j  vM>jl*i+)  =  Interpret. 

c.  With  X  close  to  zero,  logfrr;;)  has  form  +  <5,  +  XfXjVj  +  o(a),  where 

o(X)/X  —*■  0  as  X  -»  0.  Thus,  when  the  association  is  weak,  the  correlation 
model  is  similar  to  the  linear-by-linear  association  model  with  =  X  and  scores 
[Ui  =  Hi)  and  {v,  =  vj). 

10.35  For  the  general  canonical  correlation  model  with  M  components,  show  that 

M 

=  ~  71  i+x+j)2/xi+n+j- 

k=  1  i  j 

Thus,  the  squared  correlations  partition  a  dependence  measure  that  is  the  noncen¬ 
trality  (6.1 1)  of  X2  for  the  independence  model  with  n  =  1.  [Goodman  (1986) 
stated  other  partitionings.] 

10.36  Show  that  ML  estimates  do  not  exist  for  Table  10.7.  [Hint;  Haberman  (1973b, 
1 974a,  p.  398)  noted  that  if  /ft  1 1 1  =  c  >  0,  then  marginal  constraints  that  the  model 
satisfy  imply  that  fan  =  —  c.] 

10.37  For  a  loglinear  model,  explain  heuristically  why  the  ML  estimate  of  a  parameter  is 
infinite  when  its  sufficient  statistic  takes  its  maximum  or  minimum  possible  value, 
for  given  values  of  other  sufficient  statistics. 
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Models  for  Matched  Pairs 


We  next  introduce  methods  for  comparing  categorical  responses  for  two  samples  when  each 
observation  in  one  sample  pairs  with  an  observation  in  the  other  sample.  Such  matched- 
pairs  data  commonly  occur  in  studies  with  repeated  measurement  of  subjects,  such  as 
longitudinal  studies  that  observe  subjects  over  time  on  the  same  categorical  scale.  Because 
of  the  matching,  the  responses  in  the  two  samples  are  statistically  dependent.  This  is  the 
first  of  four  chapters  on  methods  for  observations  that  are  clustered  in  some  way  so  that  it 
is  not  sensible  to  treat  them  as  independent. 

Table  11.1  illustrates  matched-pairs  data.  In  the  2010  General  Social  Survey,  subjects 
were  asked  who  they  voted  for  in  the  2004  and  2008  Presidential  elections.  Between  these 
elections,  the  overall  voting  population  swung  in  the  Democrat  direction,  going  from  the 
Republican  George  W.  Bush  being  elected  in  2004  to  the  Democrat  Barack  Obama  being 
elected  in  2008.  Was  there  a  shift  in  this  direction  both  for  females  and  for  males,  and  if  so, 
were  the  shifts  of  similar  magnitude?  Table  11.1  shows  results  for  males,  for  those  sampled 
who  voted  Democrat  or  Republican  in  each  election. 

Of  the  433  males  cross-classified  in  this  table,  175  voted  Democrat  in  both  elections, 
188  voted  Republican  in  both,  and  70  changed  parties  with  their  votes.  The  two  cells  with 
identical  row  and  column  response  (the  main  diagonal  of  the  table)  contain  most  of  the 
sample,  since  relatively  few  people  changed  parties.  A  strong  association  exists  between 
the  responses,  the  sample  odds  ratio  being  (175  x  1 88)/(  1 6  x  54)  =  38.1. 

For  matched  pairs  with  a  categorical  response,  a  two-way  contingency  table  with  the 
same  row  and  column  categories  summarizes  the  data.  The  table  is  square.  In  this  chapter 
we  introduce  methods  for  analyzing  square  tables.  In  Section  11.1  we  describe  methods  for 
comparing  proportions  with  a  binary  response,  and  Section  1 1 .2  presents  logistic  regression 
analyses  of  such  data.  For  multicategory  responses  in  square  tables,  Section  1 1 .3  presents 
nominal  and  ordinal  logistic  models  for  comparing  the  response  distributions,  and  Section 
1 1 .4  introduces  loglinear  models.  In  Sections  1 1 .5  and  1 1 .6  we  discuss  two  matched-pairs 
applications  for  which  models  for  square  tables  are  useful:  analyzing  agreement  between 
two  observers  who  rate  a  common  set  of  subjects,  and  ranking  treatments  based  on  pairwise 
evaluations. 

Section  11.7  extends  the  models  for  square  tables  that  result  from  matched  pairs  to 
multiway  tables  that  result  from  matched  sets  of  observations.  In  Chapter  12  we  extend 
them  further  to  incorporate  explanatory  variables. 

Categorical  Data  Analysis,  Third  Edition.  Alan  Agresti. 
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Table  11.1  Presidential  Votes  in  2004  and  in  2008,  for  Males 
Sampled  in  2010  by  the  General  Social  Survey 


2008  Election 

2004  Election 

Democrat 

Republican 

Total 

Democrat 

175 

16 

191 

Republican 

54 

188 

242 

Total 

229 

204 

433 

11.1  COMPARING  DEPENDENT  PROPORTIONS 

For  a  subject  or  matched  pair  randomly  selected  from  the  population  of  interest,  let  TTah 
denote  the  probability  of  outcome  a  for  the  first  observation  and  outcome  b  for  the  second. 
Let  nah  count  the  number  of  such  pairs  in  a  sample  of  n  matched  pairs,  with  puh  =  nai,/n  the 
sample  proportion.  We  treat  {/!„/,}  as  a  sample  from  a  multinomial  («;  {jt(,h})  distribution. 
Then,  pu±  is  the  proportion  in  category  a  for  observation  1,  and  p+a  is  the  corresponding 
proportion  for  observation  2.  We  compare  samples  by  comparing  marginal  proportions 
(Pu+l  with  { p+a }.  With  matched  samples,  these  proportions  are  correlated,  and  methods 
for  independent  samples  are  inappropriate. 

In  this  section  we  consider  binary  outcomes.  When  n !+  =  n+\,  then  ni+  =  n+i  also, 
and  there  is  marginal  homogeneity.  Since 


7t\+  —  71 +\  =  DDi  +  ^12)  —  (Jfi  1  +  n:2[)  =  tt\2  —  7t2\, 

marginal  homogeneity  in  2  x  2  tables  is  equivalent  to  n  12  =  iti\.  The  table  then  shows 
symmetry  across  the  main  diagonal. 

11.1.1  Confidence  Intervals  Comparing  Dependent  Proportions 

One  comparison  of  the  marginal  distributions  uses  8  —  7r+i  —  7t\  +  .  Let 

d  =  p+ 1  -  p i+  =  P2+  ~  P+2- 

From  formula  (1.3)  for  multinomial  covariances,  cov(p+|,  p]+ )  —  co v(pu  +  p 21,  p\\  + 
p  12)  simplifies  to  (^11^22  —  Tt\2^2\)/n.  Thus, 


var (s/nd)  -  n ,+(1  -  ni+)  +  tt+ , ( 1  -  jt+i)  -  2(7r, ] tt22  -  ^\2^2i)-  (H  I) 

For  large  samples,  d  has  approximately  a  normal  sampling  distribution.  A  Wald  confi¬ 
dence  interval  for  8  =  7T+|  —  n\+  is 


d  ±  :a/2  o-(d). 


where 


62{d)  =  [pi+(l  -  p\+)  +  p+ 1  ( 1  -  p+ 1)  -  2(/t|  1  p22  ~  PnP2\)]/n 
=  Up  12  +  P2i)~  (p  12  -  P2\f}/n. 


(11.2) 
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with  the  second  formula  following  after  substitution  and  some  algebra.  Inverting  the  score 
test  of  H(t:  8  =  So  is  more  complex  but  provides  an  interval  having  coverage  probability 
closer  to  the  nominal  level  (Tango  1998,  Tang  et  al.  2005),  as  does  adding  \  to  each  cell 
before  computing  d  and  a(d)  for  the  Wald  interval  (Agresti  and  Min  2005b). 

11.1.2  McNemar  Test  Comparing  Dependent  Proportions 

The  hypothesis  of  marginal  homogeneity,  Hq :  it\  +  =  n+\,  is  Ho'-  5  =  0.  Under  Hq,  an 
estimated  variance  of  d  is 


*<?(<*)  = 


Pl2  +  Pl\  H\l+  «2I 


The  score  test  statistic  20  =  d/do(d)  simplifies  to 

«2i  —  n  12 


zo  = 


s/ni\  +ri\ 


(11.3) 


(11.4) 


The  square  of  z0  is  a  large-sample  chi-squared  statistic  with  df  =  1 .  The  test  using  it  is 
called  McNemar  s  test  (McNemar  1947). 

The  McNemar  statistic  depends  only  on  cases  classified  in  different  categories  for  the 
two  observations.  The  rt\\  +  m2  on  the  main  diagonal  are  irrelevant  to  inference  about 
whether  n\+  and  tt+|  differ.  However,  all  cases  contribute  to  inference  about  how  much 
7T|+  and  7T+|  differ:  for  instance,  to  estimating  8  and  the  standard  error.  Thus,  although 
relatively  large  values  of  n\\  and  mi,  for  given  n\i  and  m\,  d°  not  give  any  information 
about  whether  there  is  marginal  homogeneity,  they  do  suggest  that  whatever  heterogeneity 
may  exist  is  small.  In  summary,  pairs  with  identical  outcomes  affect  the  estimated  size  of 
the  difference  between  the  marginal  proportions  and  its  standard  error  but  do  not  affect 
statistical  significance  in  terms  of  whether  a  nonzero  difference  truly  exists.  [Agresti  and 
Min  (2003)  discussed  this  issue.] 


11.1.3  Example:  Changes  in  Presidential  Election  Voting 

For  Table  11.1,  the  sample  proportions  of  males  voting  Democrat  were  p\+  =  191/433  — 
0.441  in2004and/?+i  =  229/433  =  0.529  in  2008.  Using  ( 1 1 .2),  a  95%  confidence  interval 
for  7T+|  —  jri+  is  0.088  ±  1.96(0.0189),  or  (0.051,0.125).  We  infer  that  the  population 
percentage  of  males  voting  Democrat  increased  by  between  about  5%  and  13%. 

For  testing  marginal  homogeneity,  the  test  statistic  (1 1.4)  using  the  null  variance  is 


zo 


m\  —  n  12 

ffm\  +  nn 


54-  16 
V54+  16 


4.54, 


and  the  McNemar  statistic  =  20.63  with  df  =  1 .  The  two-sided  P-value  is  0.000006, 
extremely  strong  evidence  of  a  shift  in  the  Democrat  direction. 

For  the  corresponding  data  for  females,  shown  in  Exercise  11.2,  p\+  =0.477  and 
p+ 1  =0.615.  The  95%  confidence  interval  for  7r+|  —  7Ti+  is  0.138  ±  1.96(0.0167),  or 
(0. 1 06,  0. 1 7 1 ).  The  shift  toward  Democrat  seems  as  if  it  may  be  greater  for  females  than 
males.  We  can  check  this  with  a  95%  confidence  interval  for  the  difference  of  differences. 
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Since  the  females  and  males  are  independent  samples,  this  is 

(0.138  -0.088)  ±  l.geyiO.OlSQ)2  +  (0.0167)2,  or  (0.001,0.100). 

There  is  evidence  of  a  greater  shift  for  females  than  males,  as  much  as  10%  greater. 

11.1.4  Increased  Precision  with  Dependent  Samples 

The  final  term  of  formula  (11.1)  for  va r(^/nd)  is  — 2(jT|  |7r22  —  12^21).  Based  on 

cov(p+i,  p  |  +  ),  this  reflects  the  dependence  between  the  marginal  proportions.  By  con¬ 
trast,  for  independent  samples  of  size  n  each  to  estimate  binomial  probabilities  tt\  and  rti, 
the  covariance  for  the  sample  proportions  is  zero,  and 

var[V«  (difference  of  sample  proportions)]  =  ZT| ( 1  —  7T|)  +  7T2(  1  —  7r2). 

Dependent  samples  usually  exhibit  a  positive  dependence,  with  log  9  =  logfzri  1^22/ 
77T27t2i]  >  0;  that  is,  zrj  1  zr22  >  ^12^21  •  From  ( 1 1.1),  positive  dependence  implies  that  var(d) 
is  smaller  than  when  the  samples  are  independent. 

A  study  design  using  dependent  samples  can  help  improve  the  precision  of  statistical 
inferences  for  within-subject  effects.  The  improvement  is  substantial  when  samples  are 
highly  correlated.  To  illustrate,  Table  1 1.1  with  dependent  samples  of  size  433  each  has 
a  standard  error  of  0.019  for  d  =  0.529  —  0.441.  The  two  observations  have  strong  as¬ 
sociation,  the  sample  odds  ratio  being  38.1.  Independent  samples  of  size  433  each  with 
Tt\  —  ft2  =  0.529  —  0.441  have  a  standard  error  of  0.034,  nearly  twice  as  large. 

11.1.5  Small-Sample  Test  Comparing  Dependent  Proportions 

The  null  hypothesis  of  marginal  homogeneity  for  binary  matched  pairs  is,  equivalently, 
H0:  it |2  =  7T2i  or  7T2\ / (zr 2 1  +  7T|2)  =  0.50.  For  small  samples,  an  exact  test  conditions  on 
n*  =  n 21  +  /z  1 2 •  Under  Hq,  /z2i  has  a  binomial  (/?*,  distribution,  for  which  £(7721 )  =  \n*. 
The  P-value  for  the  test  uses  binomial  tail  probabilities. 

For  instance,  in  analyzing  changes  in  Presidential  voting,  suppose  we  focused  on  those 
of  age  less  than  30  at  the  time  of  the  2004  election.  For  the  63  males,  the  counts  are 
n  1 1  =  32,  «i2  =  4,  >72i  =  8,  and  n22  =  19.  Then,  77*  =  8  +  4  =  12  switched  party  votes, 
and  the  reference  distribution  is  bin(l2,  ^).  The  two-sided  P-value  is  the  probability  of  at 
least  8  successes  or  at  most  4  successes  out  of  12  trials;  that  is. 


P(n 21  >  8)  +  P(7»2I  <  4)  =  2P(n2\  >  8)  =  0.388. 


When  77*  >  10,  the  reference  binomial  distribution  is  approximately  normal  with  mean 
\n*  and  variance  77*(|)  (^).  The  standardized  normal  test  statistic  equals 

7721  —  Jn*  7721  —  >h2 

This  is  identical  to  the  standard  normal  form  (1 1.4)  of  the  McNemar  statistic.  With  /7|2  —4 
and  772i  =  8,  z  =  1 . 15  has  a  two-sided  P-value  of  0.248.  The  P- value  from  the  large-sample 
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analysis  tends  to  be  closer  to  the  binomial  two-sided  mid  P-value.  Here,  the  mid  P- value  is 
2[\P(n1{  =  8)  +  P (n2\  >  9)]  =  0.267. 

11.1.6  Connection  Between  McNeniar  and  Cochran-Mantel-Haenszel  Tests 

An  alternative  representation  of  binary  responses  for  n  matched  pairs  presents  the  data  in 
n  partial  tables,  one  2x2  table  for  each  pair.  It  has  columns  that  are  the  two  possible 
outcomes  for  each  measurement.  Row  1  shows  the  outcome  of  the  first  observation,  and 
row  2  shows  the  outcome  of  the  second. 

Table  1 1.2  shows  the  four  possible  partial  tables  in  this  representation.  For  Table  11.1, 
the  full  three-way  table  has  433  partial  tables;  175  look  like  the  one  for  subject  1  (i.e.. 
Democrat  vote  in  each  election),  188  who  voted  Republican  in  each  election  have  tables 
like  the  one  for  subject  2,  54  have  tables  like  the  one  for  subject  3,  and  16  have  tables 
like  the  one  for  subject  4.  The  433  subjects  from  Table  11.1  provide  866  observations  in 
a  2  x  2  x  433  contingency  table.  Collapsing  this  table  over  the  433  partial  tables  yields  a 
2x2  table  with  first  row  equal  to  (191,  242)  and  second  row  equal  to  (229,  204).  These 
are  the  total  number  of  (Democrat,  Republican)  responses  for  the  two  elections.  They  form 
the  marginal  counts  in  Table  11.1. 

For  each  subject,  suppose  that  the  probability  of  voting  Democrat  is  identical  at  each 
election.  Then,  conditional  independence  exists  between  the  vote  choice  and  the  election 
date,  controlling  for  subject.  The  probability  of  voting  Democrat  is  then  also  the  same 
for  each  election  in  the  marginal  table  collapsed  over  the  subjects.  But  this  implies  that 
the  true  probabilities  for  Table  11.1  satisfy  marginal  homogeneity.  Thus,  a  test  of  condi¬ 
tional  independence  in  the  2  x  2  x  433  table  provides  a  test  of  marginal  homogeneity  for 
Table  11.1. 

To  test  conditional  independence  in  this  three-way  table,  we  can  use  the 
Cochran-Mantel-Haenszel  (CMH)  statistic  (6.6).  The  result  of  that  chi-squared  statis¬ 
tic  is  algebraically  identical  to  the  chi-squared  form  of  McNemar’s  statistic,  namely, 
(«2i  —  «12)2/(«I2  +«2i)  for  tables  of  form  (11.1).  McNemar’s  test  is  a  special  case  of 
the  CMH  test  applied  to  the  binary  responses  of  n  matched  pairs  displayed  in  n  partial 
tables.  This  connection  is  not  helpful  for  computational  purposes,  since  the  McNemar 
statistic  is  simple.  But  it  does  suggest  ways  of  handling  more  complex  matched  data.  With 


Table  1 1.2  Representation  of  Four  Types  of  Matched 
Pairs  Contributing  to  Counts  in  Table  11.1 


Subject 

Election 

Vote  Response 

Democrat  Republican 

1 

2004 

1 

0 

2008 

1 

0 

2 

2004 

0 

1 

2008 

0 

1 

3 

2004 

0 

1 

2008 

1 

0 

4 

2004 

1 

0 

2008 

0 

1 
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several  outcome  categories  or  several  observations,  we  can  test  marginal  homogeneity  by 
applying  the  generalized  CMH  tests  (Section  8.4)  using  a  single  stratum  for  each  subject, 
with  each  row  representing  a  particular  observation  (Darroch  1981;  Mantel  and  Byar  1 978). 


11.1.7  Subject-Specific  and  Population-Averaged  (Marginal)  Tables 

We  refer  to  the  2  x  2  x  n  table  representation  of  matched-pairs  data  as  the  subject-specific 
table.  We  refer  to  the  2  x  2  table  of  form  of  Table  11.1  as  the  population-averaged  table, 
since  its  margins  provide  direct  estimates  of  population  marginal  proportions.  We’ll  use 
the  subject-specific  and  population-averaged  (or  marginal)  terminology  in  future  sections 
also  to  refer  to  models  that  apply  to  these  two  data  forms. 


11.2  CONDITIONAL  LOGISTIC  REGRESSION  FOR 
BINARY  MATCHED  PAIRS 

The  analyses  of  Section  11.1  can  be  expressed  in  the  context  of  models.  Let  (Y,  \ ,  Y,  2)  denote 
the  pair  of  observations  for  subject  (matched-pair)  i  in  the  sample,  where  a  “1”  outcome 
denotes  category  1  (success)  and  “0”  denotes  category  2.  Let  P(Y,  =  1)  denote  the  mean 
of  PiYi,  =  1)  for  all  subjects  in  the  population,  where  we  regard  Y,  as  the  response  for  a 
subject  randomly  selected  for  observation  t.  The  difference  &  =  P(Y 2  —  1)—  P(Y \  =  1) 
between  marginal  probabilities  occurs  as  a  parameter  in  the  model 

P(Y,  =  \)  =  ot  +  6x„  (11.5) 

wherexi  =  0andx2  =  L  then,  P(Y\  =  1)  =  a  and  P{Yj  =  1)  =  a  +  8.  Alternatively,  the 
logit  link  yields 


logit[P(L  =  1)]  =  a  +  fix,.  (1L6) 

The  parameter  /3  is  a  log  odds  ratio  for  the  marginal  distributions. 

11.2.1  Subject-Specific  Versus  Marginal  Models  for  Matched  Pairs 

Models  (11.5)  and  (11.6)  describe  the  marginal  distributions  of  responses  for  the  two  obser¬ 
vations.  They  are  called  marginal  models.  For  instance,  in  terms  of  the  population-averaged 
table,  model  (11.6)  is  saturated,  and  the  ML  estimate  of  fl  is  the  log  odds  ratio  of  marginal 
proportions,  =  log[(p+i/p+2)/(/fi+/P2+)]-  Exercise  11.31  shows  its  asymptotic  vari¬ 
ance. 

An  alternative  modeling  approach  focuses  on  the  subject-specific  tables  of  the  form 
shown  in  Table  1 1.2.  A  model  for  these  tables  can  allow  probabilities  to  vary  by  subject, 
using 


link[P(F„  =  1)]  =  a,  +  fix,  (11.7) 

with  subject-specific  intercepts.  This  is  called  a  subject-specific  model,  since  the  effect  ft 
is  defined  conditional  on  the  subject.  Its  estimate  describes  conditional  association  for  the 
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three-way  table  stratified  by  subject.  By  contrast,  the  effects  in  marginal  models  (1 1.5)  and 
( 1 1 .6)  are  population-averaged,  since  they  refer  to  averaging  over  the  entire  population. 

The  effects  in  these  two  types  of  model  can  be  quite  different.  To  illustrate,  for  Table  11.1 
on  Presidential  votes  in  2004  and  in  2008,  the  ML  estimate  of  the  population-averaged  effect 
P  in  logistic  model  (11.6)  is  log[(/t+i//t+2)/(/ii+/«2+)]  =  log[(229/204)/(  191/242)]  = 
0.35.  By  contrast,  from  a  result  shown  below  in  ( 1 1. 10),  the  estimate  of  the  subject-specific 
effect  P  in  model  (11.7)  with  logit  link  is  log(«2i/«i2)  =  log(54/ 1 6)  =  1.22.  In  Section 
13.2.3  we’ll  see  why  these  effects  can  differ  so  much. 

For  the  identity  link,  subject-specific  and  population-averaged  effects  are  identical. 
For  instance,  for  the  subject-specific  model  (1 1.7)  with  identity  link,  p  =  P{Yi2  =  1)  — 
P(Y,\  =  1)  for  all  /,  and  averaging  this  over  subjects  in  the  population  equates  p  to  the  <5 
parameter  in  model  ( 1 1 .5).  For  nonlinear  link  functions,  however,  the  effects  differ,  as  we’ll 
see  next. 

11.2.2  Logistic  Models  with  Subject-Specific  Probabilities 

Subject-specific  model  (1 1.7)  differs  from  models  in  earlier  chapters  by  permitting  subjects 
to  have  their  own  probability  distributions.  Cox  (1958b,  1970)  and  Rasch  (1961)  presented 
this  model  with  logit  link.  This  model  for  observation  t  for  subject  i  is 


logit[P(T/;  =  !)]=£*,-+  P.x, , 


(11.8) 


where  x\  —  0  and  xi  —  1 .  That  is, 


P(Yn  =  1) 


exp(a,-)  _  exp(a,  +  P) 

l4-exp(a,)'  1  4-  exp(a,  +  P) 


The  average  of  exp(a,  4-  Px,)/[  1  4-  exp(a,-  4-  P- r, )]  for  the  population  does  not  have  the 
form  exp(a  4-  ffx,)/[  \  4-  exp(«  4-  Px,)\  corresponding  to  the  marginal  logistic  model 
(11.6). 

Although  permitting  subject-specific  distributions,  model  (11.8)  assumes  a  common 
effect  p.  For  subject  t,  the  parameter  P  compares  the  response  distributions.  For  each 
subject,  the  odds  of  success  for  observation  2  are  exp(y8)  times  the  odds  for  observation  1 . 

Given  the  parameters,  with  model  (11.8)  the  standard  approach  assumes  independence  of 
responses  for  different  subjects  and  for  the  two  observations  on  the  same  subject.  However, 
averaged  over  all  subjects,  the  responses  are  nonnegatively  associated.  Suppose  that 
is  small  compared  with  A  subject  with  a  large  positive  a,  has  high  P(Y,-t  =  1)  for 
each  t  and  is  likely  to  have  a  success  each  time;  a  subject  with  a  large  negative  a,  has  low 
P(Yj,  =  1)  for  each  t  and  is  likely  to  have  a  failure  each  time.  The  greater  the  variability 
in  {o', } ,  the  greater  the  overall  positive  association  between  responses,  successes  (failures) 
for  observation  1  tending  to  occur  with  successes  (failures)  for  observation  2.  This  is  true 
for  any  p.  The  positive  association  reflects  the  shared  value  of  a,  for  each  observation  in 
a  pair.  No  association  occurs  only  when  ja,}  are  identical.  Thus,  the  model  does  account 
for  the  dependence  in  matched  pairs.  Fitting  it  takes  into  account  nonnegative  association 
through  the  structure  of  the  model. 

For  this  model,  the  large  number  of  {a, }  causes  difficulties  with  the  fitting  process  and 
with  the  properties  of  ordinary  ML  estimators  (Exercise  1 1.29).  The  remedy  of  condi¬ 
tional  ML  treats  them  as  nuisance  parameters  and  maximizes  the  likelihood  function  for  a 


420 


MODELS  FOR  MATCHED  PAIRS 


conditional  distribution  that  eliminates  them.  A  note  on  terminology:  Model  (11.8)  is 
sometimes  referred  to  as  a  conditional  model,  meaning  that  its  effect  p  is  subject-specific, 
conditional  on  the  subject.  The  analyses  described  below  for  such  models  are  examples  of 
conditional  logistic  regression;  but  here  the  term  conditional  refers  to  the  ML  analysis  that 
is  performed  conditional  on  sufficient  statistics  for  nuisance  parameters,  to  eliminate  those 
parameters  from  the  likelihood.  We  introduced  this  approach  in  Section  7.3. 

11.2.3  Conditional  ML  Inference  for  Binary  Matched  Pairs 

For  model  (11.8),  assuming  independence  of  responses  for  different  subjects  and  for 
the  two  observations  on  the  same  subject,  the  joint  probability  mass  function  for 

{(jn. yn) . (yn\,yn2)\  is 


n 

n 

exp  (a,  ) 

y,  i 

i 

1  —yn 

exp(a,  +  P) 

V/2 

1 

1 1 

/  =  ! 

_  1  +  expfof, )  _ 

_  1  +  exp(a,)_ 

_  1  +  exp(a,  +  P)_ 

_  1  +  exp(a,  +  P). 

In  terms  of  the  data,  this  is  proportional  to 


exp 


Y  aty> +  3>/2)  +  p(  Y  y<2) 

i  i 


To  eliminate  {a,  },  we  condition  on  their  sufficient  statistics,  the  pairwise  success  totals 
[Si  =  yn  +  V/2 } -  Given  S,  =  0,  P(Yn  =  Yn  =  0)  =  1 ,  and  given  S,  =  2,  P(Yn  =  Yi2  = 
1 )  —  1 .  The  distribution  of  (Yn,  Y, 2)  depends  on  P  only  when  S,  =  1;  that  is,  only  when 
outcomes  differ  for  the  two  responses.  Given  yn  +  yi2  —  1,  the  conditional  distribution  is 

P(Yn=yn,Yi2  =  yi2\Si  =  1) 

=  P(Yn  =  yn,  Y,2  =  yi2)/[P(Yn  =  L  Yn  =  0)  +  P(Yn  =  0,  Yi2  =  1)] 


exp  (a,  ) 

y 1 1 

i 

l-.'Vl 

exp  (oil  +  P) 

V/2 

1 

1  —  V/2 

_  1  +  exp(a,)_ 

1  +  exp(a,)_ 

_  1  +  exp(a,  +  P)  _ 

_  1  +  exp(a,  +  P)_ 

exp  (a,)  1  1  exp(a,  +  P) 

1  +  exp(a, )  1  +  exp(a,  +  P)  1  4-  exp(a,)  1  +  exp(a,  +  P) 


=  exp(P)/[\  +  exp(/3)],  yn  =  0,  yi2  =  1 
=  1/11  4-  exp(/J )],  yn  =  1,  y,2  =  0. 

Again,  let  {«<,*}  denote  the  counts  for  the  four  possible  sequences.  For  subjects  having 
Sj  =  1,  Yi\  =  ni2,  the  number  of  subjects  having  success  for  observation  1  and  failure 
for  observation  2.  Similarly,  for  those  subjects,  yi2  =  n2\  and  S,  =  n*  =  n \2  +  n2\ . 
Since  n2\  is  the  sum  of  n*  independent,  identical  Bernoulli  variates,  its  conditional  distri¬ 
bution  is  binomial  with  parameter  exp(/3)/[  1  +  exp(/l)].  For  testing  marginal  homogeneity 
(P  =  0),  the  parameter  equals  In  summary,  the  conditional  analysis  for  the  logistic  model 
implies  that  pairs  in  which  yn  =  y,2  are  irrelevant  to  inference  about  p.  When  this  model 
is  realistic,  it  provides  justification  for  comparing  marginal  distributions  using  only  the 
n i2  +  «2i  pairings  having  outcomes  in  different  categories  at  the  two  observations. 
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Conditional  on  S,  =  1,  the  joint  distribution  of  the  matched  pairs  is 


I 

1  +  exp(j8) 


exp(ff)  v'2 
.1  +  exp(/3)_ 


[expC/j)]”21 
[1  +  exp(jS)]"* ' 


(11.9) 


where  the  product  refers  to  all  pairs  having  Sj  —  1 .  Differentiating  the  log  of  this  conditional 
likelihood  and  equating  to  0  and  solving  yields  the  conditional  ML  estimator  of  fl  in  model 
( 1 1 .8).  You  can  check  that  it  and  its  standard  error  are 


$  =  log 


(11.10) 


11.2.4  Random  Effects  in  Binary  Matched-Pairs  Model 

An  alternative  remedy  to  handling  the  huge  number  of  nuisance  parameters  in  logistic 
model  (1 1.8)  treats  {»,  }  as  random  effects.  This  regards  {a, )  as  an  unobserved  random 
sample  from  a  probability  distribution,  usually  assumed  to  be  N(fx.  a2)  with  unknown  /i 
and  a .  It  eliminates  {a,  }  by  averaging  with  respect  to  their  distribution,  yielding  a  marginal 
distribution.  The  likelihood  function  then  depends  on  fi  as  well  as  the  2)  parameters. 

It  has  only  three  parameters  and  is  more  manageable.  For  matched  pairs  with  non-negative 
sample  log  odds  ratio,  this  approach  also  yields  /3  =  log(/t2i/«i2)  (Neuhaus  et  al.  1994). 
This  model  is  an  example  of  a  generalized  linear  mixed  model ,  containing  both  random 
effects  and  the  fixed  effect  /3.  Its  analysis  is  presented  in  Chapter  13. 

Model  (1 1.8)  implies  that  the  true  odds  ratio  for  each  of  the  n  subject-specific  partial 
tables  equals  exp(/l).  In  Section  6.4.5  we  presented  the  Mantel-Haenszel  estimate  of  a 
common  odds  ratio  for  several  2x2  tables.  In  fact,  that  estimator  applied  to  subject-specific 
tables  of  the  form  shown  in  Table  11.2  is  algebraically  identical  to  n2]/«i2  for  marginal 
tables  of  the  form  shown  in  Table  11.1.  (Recall  that  partial  tables  with  responses  in  only 
one  column  do  not  contribute  to  the  CMH  test  or  Mantel-Haenszel  estimate.)  In  summary, 
the  Mantel-Haenszel  estimate,  the  conditional  ML  estimate,  and  (with  nonnegative  log 
odds  ratio)  the  ML  estimate  for  the  random  effects  version  of  logistic  model  (1 1.8)  yield 
exp(/3)  =  /721  /«  12- 


11.2.5  Conditional  Logistic  Regression  for  Matched  Case-Control  Studies 

The  two  observations  (y,  i,  >’,2)  in  a  matched  pair  need  not  refer  to  the  same  subject.  For 
instance,  case-control  studies  that  match  a  single  control  with  each  case  yield  matched- 
pairs  data.  For  a  binary  response  Y.  each  case  (T  =  1)  is  matched  with  a  control  (Y  —  0) 
according  to  criteria  that  could  affect  the  response.  Subjects  in  the  matched  pairs  are 
measured  on  the  predictor  variable(s)  of  interest,  X,  and  the  XY  association  is  analyzed. 

Table  1 1 .3  illustrates.  A  case-control  study  of  acute  myocardial  infarction  (MI)  among 
Navajo  Indians  matched  144  victims  of  MI  according  to  age  and  gender  with  144  people 
free  of  heart  disease.  Subjects  were  asked  whether  they  had  ever  been  diagnosed  as  having 
diabetes  (.v  =  0,  no;  .r  =  1 ,  yes).  Table  1 1 .3  has  the  same  form  as  Table  11.1  except  that 
the  levels  of  X  rather  than  the  levels  of  Y  form  the  rows  and  the  columns. 

We  can  display  the  data  for  each  matched  case-control  pair  using  a  partial  table  of  the 
form  shown  in  Table  1 1.2,  but  reversing  the  roles  of  X  and  Y.  The  X  values  have  four 
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Table  11.3  Previous  Diagnoses  of  Diabetes  for  Myocardial 
Infarction  (MI)  Case-Control  Pairs 

MI  Cases 


MI  Controls 

Diabetes 

No  Diabetes 

Total 

Diabetes 

9 

16 

25 

No  diabetes 

37 

82 

119 

Total 

46 

98 

144 

Source:  J.  Coulehan  et  al..  Am.  ./.  Public  Health  76:  412-414,  1986. 
Reprinted  with  permission  from  the  American  Public  Health  Association. 


Table  11.4  Possible  Case-Control  Pairs  for  Table  11.3 


a 

b 

C 

d 

Diabetes 

Case 

Control 

Case 

Control 

Case 

Control 

Case 

Control 

Yes 

No 

1 

0 

0 

1 

0 

1 

1 

0 

1 

0 

1 

0 

0 

1 

0 

1 

possible  patterns,  shown  in  Table  1 1.4.  There  are  37  partial  tables  of  type  a,  since  for  37 
pairs  the  case  had  diabetes  and  the  control  did  not,  16  partial  tables  of  type  b,  9  of  type  c, 
and  82  of  type  d. 

Now,  for  subject  t  in  matched  pair  /,  consider  the  model 


logit[P(L„  =  l)j  =a,  4-/6*,,. 


(11.11) 


The  probabilities  modeled  refer  to  the  distribution  of  Y  given  X,  but  the  retrospective  study 
provides  information  only  about  the  distribution  ofX  given  Y .  We  can  estimate  the  odds  ratio 
exp(/6),  however,  because  it  refers  to  the  XY  odds  ratio,  which  relates  to  both  conditional 
distributions  (Sections  2.2.4  and  5.1.4).  Even  though  this  study  reverses  the  roles  of  X  and 
Y  in  terms  of  which  is  fixed  and  which  is  random,  the  conditional  ML  estimate  of  exp(/J) 
is  nj\/n\2  =  37/16  =  2.31. 


11.2.6  Conditional  Logistic  Regression  for  Matched  Pairs  with  Multiple  Predictors 

When  the  binary  response  has  p  predictors  for  case-control  or  subject-specific  matched 
pairs,  the  model  generalizes  to 


logitf P{Yj,  —  I)]  —  a,  +  fi\X\j,  4-  fhxiir  4-  •  •  •  4-  (11.12) 


where  denotes  the  value  of  predictor  h  for  observation  t  in  pair  /,  t  =  1.2.  Typically, 
one  predictor  is  an  explanatory  variable  of  interest,  such  as  diabetes  status.  The  others  are 
covariates  used  to  adjust  effects,  in  addition  to  those  already  controlled  by  virtue  of  using 
them  to  form  the  matched  pairs.  The  conditional  ML  approach  to  estimating  {/);}  conditions 
on  sufficient  statistics  for  a,  to  eliminate  them  from  the  likelihood. 
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Let  Xu  =  (jti ,  xpit)T  and  fi  —  (/h, . . . ,  Pp)T  ■  A  generalization  of  the  derivation  in 
Section  1 1 .2.3  shows  that 


P(Yj\  =  0,  Yj 2  =  1|S,  =  1)  =  exp(xj2p)/[exp(xj^)  +  exp (xj2p)], 

P(Yj\  -  I ,  Y i2  —  0|S,  =  1)  =  exp(x;r|/?)/[exp(x,7'l/3)  +  exp(xf2/J)].  (11.13) 


Dividing  numerator  and  denominator  by  exp!*,7,  $)  shows  that  the  first  equation  has  the 
form  of  logistic  regression  with  no  intercept  and  with  predictor  values  x*  —  X/2  —  x,  | .  In 
fact,  to  obtain  conditional  ML  estimates  for  model  (1 1 .12),  we  can  fit  a  logistic  regression 
model  to  these  “differing  outcome”  pairs  alone,  using  artificial  response  y*  =  1  when 
(y,  |  =  0,  ya  —  1),  y*  =  0  when  (_y,-|  =  1,  y,2  =  0),  no  intercept,  and  predictor  values  x*. 
This  addresses  the  same  likelihood  as  the  conditional  likelihood  (Breslow  et  al.  1978, 
Chamberlain  1980). 

To  illustrate,  for  model  ( 1 1 . 1 1 )  with  Table  1 1 .3,  let  y*  —  y,2  —  y,  i  and  x*  —  x-,2  —  x,  i .  If 
t  =  1  refers  to  the  control  and  t  =  2  to  the  case,  then  y*  —  1  always.  Since  x„  =  1  represents 
“yes”  for  diabetes  and  x =  0  represents  “no,”  (y*  —  1 ,  x*  =  —  1)  for  16  observations, 
(y*  =  1 ,  x*  =  0)for9  +  82  =  91  observations, and (y*  =  1 ,  x*  =  +  l)for37observations. 
The  logistic  model  that  forces  a  =  0  has  ft  —  0.84.  With  a  single  binary  predictor,  the 
estimate  is  identical  to  log(«2i /« 12)- 


11.2.7  Marginal  Models  and  Subject-Specific  Models:  Extensions 

For  binary  matched-pairs  data,  Section  11.1  presented  analyses  for  a  marginal  (i.e., 
population-averaged)  model,  and  this  section  presented  analyses  for  a  subject-specific 
model.  These  models  generalize  to  multinomial  responses  and  to  matched  sets.  For  multi¬ 
nomial  responses,  Chamberlain  (1980)  proposed  conditional  ML  for  matched  pairs.  For 
binary  responses  with  a  matched  set,  model  (1 1.12)  applies  when  a,-  refers  to  the  set.  The 
matched  set  might  refer  to  repeated  measurements  on  subject ;,  or  it  could  refer  to  a  cluster 
of  subjects,  such  as  children  from  family  /  or  fetuses  from  litter  i. 

With  extensions  of  the  subject-specific  model  to  matched-set  clusters,  the  conditional 
ML  approach  is  restricted  to  estimating  fij  that  are  within-cluster  effects,  such  as  occur  in 
case-control  and  crossover  studies.  For  these,  the  explanatory  variable  varies  in  t  for  each  i. 
Conditional  ML  cannot  estimate  a  between-cluster  effect.  Statistics  providing  information 
about  such  an  effect  use  subject  totals  at  different  levels  of  the  relevant  explanatory  variable; 
however,  those  totals  sum  the  sufficient  statistics  for  {a,  },  so  they  are  themselves  fixed  and 
have  degenerate  distributions  after  conditioning  on  the  sufficient  statistics.  An  explanatory 
variable  that  is  constant  in  t  for  each  /  cancels  out  of  the  conditional  likelihood.  [You  can 
observe  this  for  matched  pairs  with  (1 1.13)  for  any  j  for  which  xjn  =  xjn  all  /.]  For  it,  at 
best  one  can  stratify  by  its  levels  and  fit  a  model  estimating  within-cluster  effects  separately 
at  each  level.  With  subject-specific  models,  an  advantage  of  using  the  random  effects 
approach  instead  of  conditional  ML  is  that  it  is  not  restricted  to  estimating  within-cluster 
effects. 

In  the  remainder  of  this  chapter,  we  emphasize  marginal  models  for  matched  pairs 
with  multinomial  responses.  In  the  following  Chapter  12  we  deal  with  marginal  model 
extensions  allowing  matched  sets  and  explanatory  variables.  Subject-specific  models  using 
a  random  effects  approach  have  extra  computational  complexities.  We  mention  briefly 
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some  multinomial  subject-specific  models  in  this  chapter,  but  we  defer  most  discussion  to 
Chapter  13. 

11.3  MARGINAL  MODELS  FOR  SQUARE  CONTINGENCY  TABLES 

Matched-pairs  analyses  generalize  from  binary  to  /  >  2  outcome  categories.  A  square 
I  x  I  table  {«„/,}  shows  counts  of  possible  sequences  (a,  b)  of  outcomes  for  (Yu,  Y,  2).  Let 
nab  =  P(Y\  =  a,  Y2  =  b)  for  a  randomly  selected  subject  (i.e.,  the  mean  of  corresponding 
subject-specific  probabilities  in  the  population  of  interest).  Marginal  homogeneity  is  nu+  = 
n+ll  fora  =  1 . / .  Marginal  models  compare  {ttu+}  and  {7i+a}. 

11.3.1  Marginal  Models  for  Nominal  Classifications 

For  nominal-scale  matched-pair  responses,  marginal  model  (11.6)  for  binary  matched  pairs 
extends  to  the  baseline-category  logit  model 


logj/’O',  =  j)/P(Y,  =  /)]=«,  +f)jX„  t  =  1,2,  7  =  1,...,/  —  1,  (11.14) 

where  x\  =  0  and  X2  =  1.  This  model  has  2(1  —  1)  parameters  for  the  2(1  —  1)  marginal 
probabilities.  It  is  saturated. 

Marginal  homogeneity  is  the  special  case  =  ■  ■  ■  =  pt_x  =  0.  To  fit  it,  Lipsitz  et  al. 
(1990)  and  Madansky  (1963)  maximized  the  multinomial  likelihood  for  {«„*)  subject  to 
these  constraints.  Iterative  methods  produce  fitted  values  {£<,/,}.  Comparing  these  to  {«„/,) 
using  G 2  or  X 2  tests  marginal  homogeneity,  with  df  —  /  —  1 . 

Bhapkar  (1966)  tested  marginal  homogeneity  by  exploiting  the  asymptotic  normality  of 
marginal  proportions.  Let  da  =  p+u  —  pu+,  and  let  dr  =  (d\ , . . . ,  t//_i).  It  is  redundant  to 
include  di,  since  da  =  0-  The  sample  covariance  matrix  V  of  s/nd  has  elements 

Vab  =  ~(Pab  +  Pba)  ~  (P+a  ~  Pa+)(P+b  ~  Pb+)  for  a  /  b, 

^aa  —  P+ci  T  Pa+  ^ Paa  ( P+a  Pa  +  )  ■ 

Now  s/n[d  —  E(d)]  has  an  asymptotic  multivariate  normal  distribution  with  estimated 
covariance  matrix  V.  Under  marginal  homogeneity,  E(d)  =  0,  and 

W=ndTV~'d  (11.15) 

is  asymptotically  chi-squared  with  df  =  /  —  1.  This  is  a  Wald  test  for  parameters  in  the 
analog  of  model  (1 1.14)  using  the  identity  link.  Stuart  (1955)  proposed  Wo  =  ndJV^d, 
which  uses  the  sample  null  covariance  matrix  Vo  and  is  the  score  test.  This  has 

VahO  =  ~(Pab  +  Pba)  for  Cl  /  H, 

VaaO  =  P+a  T  Pa+  2  j)aa . 


Ireland  etal.  (1969)  noted  that  W  =  Wo/(l  —  Wo/n).  For  /  =  2,  Wo  is  McNemar’s  statistic, 
the  square  of  (1 1.4). 
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Table  11.5  Region  of  Residence  in  2010  and  at  Age  16,  with  Fit  of  Marginal 
Homogeneity  Model 

Residence  in  2010 

Residence  - 


at  Age  1 6 

Northeast 

Midwest 

South 

West 

Total 

Northeast 

266 

15 

61 

28 

370 

(266) 

(12.6) 

(35.7) 

(16.3) 

(330.6) 

Midwest 

10 

414 

50 

40 

514 

(12.3) 

(414) 

(32.8) 

(26.2) 

(485.4) 

South 

8 

22 

578 

22 

630 

(27.7) 

(46.1) 

(578) 

(21.9) 

(673.6) 

West 

7 

6 

27 

301 

341 

(24.6) 

(12.7) 

(27.1) 

(301) 

(365.4) 

Total 

291 

457 

716 

391 

1855 

(330.6) 

(485.4) 

(673.6) 

(365.4) 

Source:  2010  General  Social  Survey. 


11.3.2  Example:  Regional  Migration 

For  the  2010  GSS  of  American  adults.  Table  11.5  compares  the  respondent's  region  of 
residence  with  their  region  of  residence  at  age  16.  Relatively  few  people  changed  regions. 
84%  of  the  observations  falling  on  the  main  diagonal.  The  ML  fit  of  marginal  homogene¬ 
ity,  shown  in  Table  1 1.5,  gives  G 2  —  93.64  (df  =  3).  Statistics  using  differences  in  sam¬ 
ple  marginal  proportions  give  similar  results.  For  instance,  Bhapkar's  statistic  (11.15)  is 
IV  =  90.44  (df  =  3). 

The  sample  marginal  percentages  for  the  four  regions  were  (19.9,  27.7,  34.0,  18.4) 
at  age  16  and  (15.7,  24.6,  38.6,  21.1)  in  2010.  The  large  test  statistics  reflect  the  large 
sample  size.  To  estimate  the  change  for  a  given  region,  we  apply  (1  1.2)  to  the  collapsed 
2x2  table  that  combines  the  other  regions.  A  95%  confidence  interval  for  tt+\  —  Ji\+  is 
(0.157  —  0.199)  ±  1.96(0.006),  or  —0.043  ±0.012.  Similarly,  a  95%  confidence  interval 
for  n+2  —  Jt2+  is  —0.031  ±  0.013,  for  7r+3  —  7t}+  is  0.046  ±  0.014,  and  for  n+ 4  —  714+  is 
0.027  ±0.012. 

11.3.3  Marginal  Models  for  Ordinal  Classifications 

For  ordered  categories,  marginal  model  (1 1.6)  for  binary  matched  pairs  extends  using 
ordinal  logits.  With  cumulative  logits, 

logit[P(F,  <. /)]  =  «,  +  0x„  t  =  1,2,  j  =  I . /-  1,  (11.16) 

where  .Vi  =  0  and  .xa  =  1.  This  model  has  proportional  odds  structure  (Section  8.2.2).  The 
odds  of  outcome  F?  <  j  equal  exp(/S)  times  the  odds  of  outcome  Y\  <  j.  The  model  implies 
stochastically  ordered  marginal  distributions,  with  >  0  meaning  that  Y\  tends  to  be  higher 
than  Ft.  Marginal  homogeneity  corresponds  to  p  —  0. 

Model  fitting  treats  (F 1 ,  Ft)  as  dependent.  The  ML  approach  maximizes  the  multinomial 
likelihood  for  {7 r„/,).  This  is  not  simple.  Since  the  model  refers  to  marginal  probabilities 
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Table  11.6  Opinions  on  Premarital  Sex  and  Extramarital  Sex 


Premarital  Sex 

Extramarital  Sex 

Total 

1 

2 

3 

4 

1 

324 

6 

1 

0 

331 

2 

95 

16 

1 

0 

1 12 

3 

185 

30 

17 

1 

233 

4 

462 

120 

51 

28 

661 

Total 

1066 

172 

70 

29 

1337 

Source:  2008  General  Social  Survey. 


{P{Y i  =  a)  —  7ta+}  and  { P(Y2  =  hi)  =  7T+/,),  we  cannot  substitute  the  model  formula  in 
the  kernel  ^ Zbnak  log  txab  of  the  log  likelihood,  which  refers  to  joint  probabilities.  We 

defer  discussion  of  ML  model  fitting  of  marginal  models  to  Sections  12.1.4  and  12.1.5. 
Model  (1 1.16)  describes  the  2(/  —  1)  marginal  probabilities  by  /  parameters,  so  df  =  /  —2 
for  testing  fit. 

The  nominal-scale  tests  of  marginal  homogeneity  presented  in  Section  1 1.3.1  use  all 
/  —  1  degrees  of  freedom  available  for  comparisons  of  /  pairs  of  marginal  probabilities. 
With  ordered  categories,  ordinal  tests  can  focus  on  a  single  parameter,  Hq:  f>  =  0,  with 
df  =  1.  When  /  is  large  and  the  dependence  between  classifications  is  strong,  the  ordinal 
tests  can  be  much  more  powerful. 


11.3.4  Example:  Opinions  on  Premarital  and  Extramarital  Sex 

Refer  to  Table  1 1.6.  For  the  2008  GSS,  subjects  gave  their  opinion  about  premarital  sex 
(a  couple  having  sex  before  marriage)  and  extramarital  sex  (a  married  person  having  sex 
with  someone  other  than  the  marriage  partner).  The  response  categories  are  1  =  always 
wrong,  2  =  almost  always  wrong,  3  =  wrong  only  sometimes,  4  =  not  wrong  at  all. 

The  sample  cumulative  marginal  proportions  are  (0.25,  0.33,  0.51,  1.0)  for  premarital 
sex  and  (0.80,  0.92,  0.98,  1.0)  for  extramarital  sex.  Responses  on  premarital  sex  tended  to 
be  more  tolerant  than  those  on  extramarital  sex.  The  cumulative  logit  model  (11.16)  has 
$  =  2.780  (SE  =  0.079).  There  is  extremely  strong  evidence  that  population  responses  are 
more  negative  on  extramarital  than  on  premarital  sex.  The  fit  of  the  marginal  homogeneity 
model  has  G 2  =  1092.34  (df  =  3),  and  the  fit  of  the  ordinal  model  (1 1.16)  has  G2  = 
105. 14  (df  =  2).  The  ordinal  model  does  not  fit  well,  but  it  fits  much  better  than  the  marginal 
homogeneity  model.  Models  to  be  considered  in  Section  1 1 .4.7  fit  much  better  yet. 


11.4  SYMMETRY,  QUASI-SYMMETRY,  AND  QUASI-INDEPENDENCE 

An  alternative  analysis  of  square  contingency  tables  directly  models  the  joint  distribution 
using  logistic  or  loglinear  models.  Some  models  have  marginal  homogeneity  as  a  special 
case. 

An  /  x  /  joint  distribution  {nat,}  satisfies  symmetry  if 


7 Tab  =  Jtha  whenever  a  ^  h. 


(11.17) 


SYMMETRY.  QUASI-SYMMETRY,  AND  QUASI-INDEPENDENCE 


427 


Under  symmetry,  na+  =  nah  =  Ylh  nha  =  tt+a  for  all  a,  so  marginal  homogeneity 
occurs.  For  1—2,  symmetry  is  equivalent  to  marginal  homogeneity,  but  for  I  >  2,  marginal 
homogeneity  can  occur  without  symmetry. 

11.4.1  Symmetry  as  Logistic  and  Loglinear  Models 

When  all  i rab  >  0,  symmetry  is  a  logistic  model  and  a  loglinear  model.  In  logistic  form,  it 
is  trivially 


log(7rUft/7T/,0)  =  0  for  all  a  <  b. 
For  expected  frequencies  {/iah  =  mtah),  it  has  the  loglinear  form 


log  flab  -  1  + 


(11.18) 


where  all  kuh  =  Xha.  Both  classifications  have  the  same  single-factor  parameters  {Au},  so 
log  Hah  =  log  Hha- 

For  Poisson  or  multinomial  cell  counts  {nab},  the  likelihood  equations  are 

A  ah  +  A  ha  —  nab  +  nha  for  all  a  <  h  and  jxau  —  naa  for  all  a. 

The  main  diagonal  has  perfect  fit.  The  solution  that  satisfies  symmetry  is 

nah  +  nha 


for  all  a,  b. 


The  logistic  symmetry  model  has  residual  df  =  /(/  —  l)/2.  For  testing  symmetry, 
Bowker  (1948)  showed  that  X2  simplifies  to 


X1 


EE.. 


(nah  -  nha)- 
<h  nuh  +  nhu 


For  1—2  this  is  the  chi-squared  form  of  McNemar’s  statistic,  the  square  of  (1 1.4).  The 
standardized  residuals  equal 


ruh  =  ( nah  ~  nha)/s/nah  +  nha. 

Only  one  residual  for  each  pair  of  categories  is  nonredundant,  since  rUb  =  — i'ba ■  They 
satisfy  EEa<b>'Zh  =  X2- 

The  symmetry  model  is  very  simple.  Except  for  a  few  specialized  applications,  such  as 
describing  intraobserver  agreement  for  pairs  of  measurements  by  an  observer,  it  rarely  fits 
well.  When  the  marginal  distributions  differ  substantially,  it  necessarily  fits  poorly. 

11.4.2  Quasi-symmetry 

To  accommodate  marginal  heterogeneity,  we  can  permit  the  main-effect  terms  in  the  sym¬ 
metry  model  (1 1 .18)  to  differ.  The  resulting  loglinear  model,  called  quasi-symmetry,  is 


log  H„b  =  X  +  Ay  +  +  XU/,, 


(11.19) 
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where  Xab  =  kha  for  all  a  <  b  (Caussinus  1966).  Symmetry  is  the  special  case  a*  =  kYu 

for  a  =  1 . / ,  and  independence  is  the  special  case  in  which  all  kllh  =  0.  Identifiability 

requires  further  constraints,  such  as  X*  =0  and  all  XY  =  0  (Exercise  1 1 .36).  The  likelihood 
equations  for  the  quasi-symmetry  model  are 


fia+  =  na+,  a  =  1 . /, 

fi+h  =  n+h,  b  =  1 . /,  (1 1.20) 

P-uh  +  fi-ha  =  lab  +  nha  for  a  <  b. 

Only  one  of  the  first  two  sets  of  equations  is  needed.  The  other  is  redundant,  given  the 

other  two.  The  residual  df  =  (/  —  1)(/  —  2)/2.  From  ( 1 1.20),  paa  =  naa  for  a  =  1 . / . 

Otherwise,  the  likelihood  equations  do  not  have  a  direct  solution  and  are  solved  using 
iterative  methods  such  as  Newton-Raphson  or  IPF. 

The  meaning  of  quasi-symmetry  is  less  obvious  than  symmetry.  However,  it  usually  fits 
much  better  and  has  greater  scope.  The  main-effect  parameters  determine  the  relative  sizes 
of  nai,  and  Hba.  For  example,  with  constraints  that  set  all  XYh  =  0,  log (iiah/ Hba)  =  X*  —  X* . 
The  quasi-symmetry  model  has  multiplicative  form 


Xab  =  Ota  PbYab ,  where  yah  =  yha  all  a  <b  (1 1.21) 


and  all  parameters  are  positive.  The  symmetry  model  is  (1 1.21)  with  aa  =  fia  for  alia. This 
equation  indicates  that  a  table  satisfying  quasi-symmetry  is  the  cellwise  product  of  a  table 
satisfying  independence  with  one  satisfying  symmetry.  The  association  symmetry  implies 
that  odds  ratios  on  one  side  of  the  main  diagonal  are  identical  to  corresponding  odds  ratios 
on  the  other  side.  In  fact,  the  model  can  be  defined  by  properties  such  as 


l^ab  M/ / 
Ma/  M//? 


l~Lba  M// 


for  all  a  <  b 


(11.22) 


or  ftah  —  0b, i  for  all  local  odds  ratios.  Goodman  (1979a)  referred  to  it  as  the  symmetric 
association  model. 

Another  way  to  interpret  the  model  parameters  relates  to  subject-specific  logistic  models. 
Consider  the  adaptation  of  baseline-category  logit  model  (11.14)  to  a  subject-specific 
model, 


log[P(T„  =  j)/P(Yit  =  /)]  =  Oij  +  fax, ,  r  =  1 ,  2,  j  =  1 . /  -  1 ,  (11 .23) 

where  .V|  =  1  and  xz  =  0.  This  has  the  additive  form  of  binary  model  (1 1.8)  for  each  j. 
The  model  implies,  averaging  over  subjects,  that  the  quasi-symmetry  model  (11.19)  holds 
for  the  1x1  population-averaged  table  with  {/8;  =  A.J},  when  we  constrain  kf  =  0  and 
all  XY  =  0  (Section  12.2.7  shows  this  in  the  binary  case).  In  fact,  for  the  conditional  ML 
analysis  that  conditions  out  {a,-,-},  the  conditional  ML  estimates  of  {$/}  are  identical  to 
the  ML  estimates  of  {a*  }  for  the  quasi-symmetry  model  with  these  constraints  (Conaway 
1989). 
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11.4.3  Marginal  Homogeneity  and  Quasi-symmetry 

Marginal  homogeneity  is  not  equivalent  to  a  loglinear  model.  However,  quasi-symmetry  is  a 
useful  model  for  studying  marginal  homogeneity.  Caussinus  (1966)  showed  that  symmetry 
is  equivalent  to  quasi-symmetry  and  marginal  homogeneity  holding  simultaneously.  We 
have  seen  that  symmetry  implies  both  quasi-symmetry  and  marginal  homogeneity.  Now  we 
give  Caussinus’s  argument  for  the  converse,  that  the  joint  occurrence  of  quasi-symmetry 
and  marginal  homogeneity  implies  symmetry. 

From  (1 1.21),  if  quasi-symmetry  holds,  nah  =  aa  PhYah*  where  y„/,  =  yha  >  0  for  all 
a  <  b.  Equivalently, 


^ ah  —  Pa  &ah , 

where  pa  =  a a/fia  and  8ah  =  f}a  PhYab  also  satisfies  8ab  —  8i,a  >  0  for  all  a  <  b.  If  there  is 
also  marginal  homogeneity,  then 

Xj+  =  Pj  }  '  &jh  =  ^  ^  Pa&aj  =  ^+y. 
h  a 


or 


Pi  ~  (  E  P°^ai)/ (  E  ^Jh)  —  (  E  Pa&aj^l ^  E  ^hi}'  j  —  ^ . ^  ■ 

a  b  a  b 

Thus,  each  pj  is  a  weighted  average  of  {pa},  with  weights  {  &aj!  y2h  >  0,  a  —  1 . 
Any  set  {pa}  satisfying  this  must  be  identical.  Otherwise,  there  would  be  a  pj  that  is  no 
greater  than  any  pa  but  smaller  than  at  least  one,  and  hence  it  could  not  be  a  positive 
weighted  average  of  all  of  them.  But  since  {pa}  are  identical,  jru/,  =  pu  8ah  =  Pb  8ah  = 
ph  8hu  =  jibe  so  symmetry  holds.  Thus,  a  table  that  satisfies  both  quasi-symmetry  and 
marginal  homogeneity  also  satisfies  symmetry.  Since  the  converse  holds, 

quasi-symmetry  +  marginal  homogeneity  =  symmetry.  (1 1.24) 

It  follows  that  when  quasi-symmetry  (QS)  holds,  marginal  homogeneity  (MH)  is  equiva¬ 
lent  to  symmetry  (S),  which  is  {A*  =  kYa,a  =  1 ,...,/}  in  the  QS  model.  Thus,  conditional 
on  quasi-symmetry,  testing  marginal  homogeneity  is  equivalent  to  testing  symmetry.  A  test 
of  marginal  homogeneity  compares  fit  statistics  for  the  symmetry  and  quasi-symmetry 
models. 


G2(S|QS)  =  G2(S)-G2(QS),  (11.25) 

with  df  =  /  —  1.  This  is  an  alternative  to  approaches  discussed  in  Section  11.3.1  using 
baseline-category  logit  marginal  models. 

11.4.4  Quasi-independence 

Square  tables  usually  exhibit  positive  dependence,  manifested  by  larger  counts  on  the  main 
diagonal  than  the  independence  model  predicts.  Conditional  on  the  event  that  a  matched 
pair  falls  off  the  main  diagonal,  though,  the  relationship  may  have  a  simple  structure. 
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A  square  contingency  table  satisfies  quasi-independence  when  the  variables  are  inde¬ 
pendent,  given  that  the  row  and  column  outcomes  differ.  This  has  the  loglinear  form 

log  pah  =  X  +  X*  +  XYh  +  S„I(a  -  b),  (1 1.26) 


where  /(•)  is  the  indicator  function. 


I  (a  =  b)  = 


I, 

0, 


a  =  b 
a^b. 


This  adds  a  parameter  to  the  independence  model  for  each  cell  on  the  main  diagonal.  The 
first  three  terms  on  the  right-hand  side  of  (1 1.26)  specify  independence,  and  {<5a}  permit 
[paa )  to  depart  from  this  pattern  and  have  arbitrary  positive  values.  When  8a  >  0,  paa  is 
larger  than  under  independence. 

The  likelihood  equations  for  quasi-independence  are 


A-t-u  —  It+tf  *  fi-cui  —  Mail'  ®  —  1 ,  .  .  .  , 


A  perfect  fit  occurs  on  the  main  diagonal,  but  independence  holds  for  the  remaining  cells. 
The  model  implies  that  odds  ratios  equal  1.0  for  all  rectangularly  formed  2x2  tables  in 
which  all  cells  fall  off  the  main  diagonal.  The  model  can  be  fitted  using  Newton-Raphson 
or  IPF.  The  model  has  /  more  parameters  than  the  independence  model,  so  its  residual 
df  =  (/  —  1  )2  —  / .  It  applies  to  tables  with  /  >  3. 

Quasi-independence  is  the  special  case  of  quasi-symmetry  (1 1 .21 )  in  which  {ytlh  for  a  ^ 
b )  are  identical.  They  are  equivalent  when  1=3  (Caussinus  1966,  p.  146). 


11.4.5  Example:  Migration  Revisited 

For  Table  1 1 .5  on  migration  patterns,  not  surprisingly  the  independence  model  fits  terribly 
(G2  =  2828.22,  df  —  9).  The  symmetry  model  is  also  unpromising.  For  instance,  40  people 
moved  from  the  Midwest  to  the  West,  but  only  6  people  made  the  reverse  move.  The  deviance 
for  testing  symmetry  is  G2  =  100.48  (df  =  6).  The  much  smaller  deviance  compared  with 
the  independence  model  reflects  that  it  forces  a  perfect  fit  on  the  main  diagonal,  where  most 
observations  occur. 

The  quasi-symmetry  model  has  G2  =  6.98,  with  df  =  3.  Table  1 1.7  displays  its  fit.  The 
difference  G2(S  |  QS)  =  100.48  —  6.98  =  93.50  (df  =  3)  shows  extremely  strong  evidence 
of  marginal  heterogeneity.  Results  are  similar  to  those  quoted  in  Section  1 1.3.2  for  the 
likelihood-ratio  test  based  on  baseline-category  logit  model  (11.14),  for  which  G2  =  93.64 
(df  =  3). 

The  lack  of  symmetry  in  cell  probabilities  reflects  the  marginal  heterogeneity.  The  effects 
can  be  described  using  the  quasi-symmetry  model  parameter  estimates.  With  the  constraints 
X*  =  0  and  all  XYh  =  0,  we  have  {if  =  1.74,  if  =  1 .21,  if  =  0.02).  For  example,  since 
log(/x i4//jt4i )  =  Xf ,  the  estimated  odds  of  moving  from  the  Northeast  to  the  West  were 
exp(l  .74)  =  5.7  times  the  odds  of  moving  from  the  West  to  the  Northeast. 

Quasi-independence  states  that  for  people  who  moved,  their  region  in  2010  is  inde¬ 
pendent  of  their  region  at  age  16.  Table  11.7  also  contains  its  fitted  values,  for  which 
G2  =  9.21  (df  =  5).  Its  fit  is  not  significantly  poorer  than  the  quasi-symmetry  model. 
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Table  11.7  Fits  of  Models  to  Migration  Table  11.5 

Residence  in  2010 


Residence  at  Age  16 

Northeast 

Midwest 

South 

West 

Total 

Northeast 

266 

15 

61 

28 

370 

(266,  266)" 

(18.3,  15.7) 

(54.8,  58.5) 

(30.9,  29.8) 

Midwest 

10 

414 

50 

40 

514 

(10.9,9.3) 

(414,414) 

(57.0,  55.2) 

(32.1,35.5) 

South 

8 

22 

578 

22 

630 

(9.1,  10.5) 

(15.9,  16.8) 

(578,  578) 

(26.9,  24.8) 

West 

7 

6 

27 

301 

341 

(5.0,  5.2) 

(8.8,  10.5) 

(26.2,  24.2) 

(301,  301) 

Total 

291 

457 

716 

391 

1855 

"First  value  is  quasi-independence  fit,  second  is  quasi-symmetry  fit;  both  models  give  a  perfect  fit  on  main 
diagonal. 


11.4.6  Ordinal  Quasi-symmetry 

The  loglinear  models  presented  so  far  for  square  tables  treat  classifications  as  nominal. 
With  ordered  categories,  more  parsimonious  models  are  useful.  Let  u\  <•••<«/  denote 
ordered  scores  for  both  the  row  and  columns.  An  ordinal  quasi-symmetry  model  is 


log  dab  =  A.  +  A.u  +  A .ft  +  fil(h  +  Xah,  ( 1 1 .27) 

where  Xab  —  Xba  for  all  a  <  b.  It  is  the  special  case  of  the  quasi-symmetry  model  (11.19) 
in  which 

A*  —  A*  =  flu  i, 

has  a  linear  trend.  Symmetry  is  the  special  case  fi  =  0. 

This  model  has  logistic  representation, 

\o%(nah/nha)  =  fi(uh  -  ua)  for  a  <  b.  (11 .28) 

This  models  the  logit  of  the  conditional  probability  of  cell  ( a,b ),  given  response  sequence 
(i a,b )  or  (b,a).  The  greater  the  value  of  \fi\,  the  greater  the  difference  between  Ttah  and  7tha 
and  hence  between  the  marginal  distributions.  Its  likelihood  equations  are 

^2  u“  £«+ =  X!  ^2 Uh  ii+h  =  IZ  i,hn+h' 

a  abb 

dab  +  dba  =  nab  +  nba  for  a  <  t. 

The  fitted  marginal  counts  need  not  equal  the  observed  marginal  counts.  However,  dividing 
the  first  two  equations  by  n  shows  that  they  have  the  same  means,  for  the  chosen  scores. 

When  fi  ^  0,  this  model  implies  stochastically  ordered  margins.  When  fi  >  0,  re¬ 
sponses  have  a  higher  mean  in  the  column  distribution.  Like  the  ordinal  marginal  models 
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(Section  1 1.3.3),  this  model  concentrates  the  marginal  effect  on  df  =  1 .  A  test  of  marginal 
homogeneity  (Hq:  /3  =  0)  uses 

ordinal  quasi-symmetry  +  marginal  homogeneity  =  symmetry. 

The  likelihood-ratio  test  statistic  compares  the  deviance  for  symmetry  and  ordinal  quasi¬ 
symmetry. 

We  can  fit  model  (1 1.28)  with  logistic  model  software:  Identify  (/?„/,,  nha)  as  binomial 
with  n„h  +  nha  trials,  and  fit  a  logistic  model  with  no  intercept  and  predictor  x  —  Uh  —  ua- 

11.4.7  Example:  Premarital  and  Extramarital  Sex  Revisited 

For  Table  1 1 .6  on  attitudes  toward  premarital  and  extramarital  sex,  a  cursory  glance  at  the 
data  reveals  that  the  symmetry  model  is  inadequate  ( G 2  —  1243.07,  df  =  6).  By  compari¬ 
son,  quasi-symmetry  fits  well  (G2  =  0.65,  df  —  3). 

The  simpler  model  of  ordinal  quasi-symmetry  also  fits  well:  With  scores  (1, 2,  3,  4), 
G2  =  2.81  (df  =  5).  The  ML  estimate  jff  =  —3.035.  From  (1 1.28),  the  estimated  proba¬ 
bility  that  outcome  on  premarital  sex  is  x  categories  more  positive  than  the  outcome  on 
extramarital  sex  equals  exp(3.035x)  times  the  reverse  probability. 


11.5  MEASURING  AGREEMENT  BETWEEN  OBSERVERS 

We  now  discuss  an  application,  analyzing  agreement  between  two  observers,  that  uses 
models  for  matched  pairs.  We  illustrate  with  Table  1 1 .8.  This  shows  ratings  by  two  pathol¬ 
ogists,  labeled  A  and  B ,  who  separately  classified  1 18  slides  regarding  the  presence  and 
extent  of  carcinoma  of  the  uterine  cervix.  The  rating  scale  has  the  ordered  categories  (1) 
negative,  (2)  atypical  squamous  hyperplasia,  (3)  carcinoma  in  situ,  and  (4)  squamous  or 
invasive  carcinoma. 


Table  11.8  Diagnoses  of  Carcinoma 


Pathologist  A 

Pathologist  B“ 

Total 

1 

2 

3 

4 

1 

22 

2 

2 

0 

26 

(8.5) 

(-0.5) 

(-5.9) 

(-1.8) 

2 

5 

7 

14 

0 

26 

(-0.5) 

(3.2) 

(-0.5) 

(-1.8) 

3 

0 

2 

36 

0 

38 

(-4.1) 

(-1.2) 

(5.5) 

(-2.3) 

4 

0 

1 

17 

10 

28 

(-3.3) 

(-1.3) 

(0.3) 

(5.9) 

Total 

27 

12 

69 

10 

1  18 

“Values  in  parentheses  are  standardized  residuals  for  the  independence  model. 
Source:  N.  Holmquistet  al„  Arch.  Pathol.  84:  334-345,  1967.  Reprinted  with  permission 
from  the  American  Medical  Association.  See  also  Landis  and  Koch  ( 1 9771. 


MEASURING  AGREEMENT  BETWEEN  OBSERVERS 


433 


11.5.1  Agreement:  Departures  from  Independence 

For  Table  11.8,  let  7tab  denote  the  probability  that  observer  A  classifies  a  slide  in  category  a 
and  observer  B  classifies  it  in  category  b.  Then  nau  is  the  probability  that  they  both  choose 
category  a,  and  naa  is  the  total  probability  of  agreement.  Perfect  agreement  occurs 
when  *aa  =  1  • 

With  subjective  scales,  agreement  is  less  than  perfect.  Analyses  focus  on  describing 
strength  of  agreement  and  detecting  patterns  of  disagreement.  Agreement  and  association 
are  distinct  facets  of  the  joint  distribution.  Strong  agreement  requires  strong  association, 
but  strong  association  can  exist  without  strong  agreement.  If  observer  A  consistently  rates 
subjects  one  category  higher  than  observer  B,  strength  of  agreement  is  poor  even  though 
the  association  is  strong. 

Evaluations  of  agreement  can  compare  {/;„/,}  to  the  values  {nu+n+b/n}  predicted  under 
independence.  That  model  is  a  baseline,  showing  the  agreement  expected  if  no  association 
existed  between  ratings.  Normally,  it  fits  poorly  if  even  mild  agreement  exists,  but  its 
cell  standardized  residuals  (Section  3.3.1)  show  patterns  of  agreement  and  disagreement. 
Ideally,  standardized  residuals  are  large  positive  on  the  main  diagonal  and  large  negative 
off  that  diagonal.  The  sizes  are  influenced  by  sample  size  n ,  with  larger  values  tending  to 
occur  with  larger  n. 

The  independence  model  fits  Table  1 1.8  poorly  (G2  =  1 17.96,  df  =  9).  That  table  also 
reports  the  standardized  residuals.  The  large  positive  residuals  on  the  main  diagonal  indicate 
that  agreement  for  each  category  is  greater  than  expected  by  chance,  especially  for  the  first 
category.  Off  the  main  diagonal  they  are  primarily  negative.  Disagreements  occurred  less 
than  expected  under  independence,  although  the  evidence  of  this  is  weaker  for  categories 
closer  together.  The  most  common  disagreements  were  observer  B  choosing  category  3  and 
observer  A  instead  choosing  category  2  or  4. 

11.5.2  Using  Quasi-independence  to  Analyze  Agreement 

More  complex  models  add  components  that  relate  to  agreement  beyond  that  expected  under 
independence.  A  useful  generalization  is  quasi-independence  (1 1.26),  which  adds  main- 
diagonal  parameters  {6,,}.  For  Table  11.8,  this  model  has  G2  =  13.18  (df  =  5).  It  fits  much 
better  than  independence,  but  some  lack  of  fit  remains.  Table  1 1 .9  shows  the  fit. 


Table  11.9  Fitted  Values  for  Carcinoma  Diagnoses  of  Table  11.8 


Pathologist  A 

Pathologist  B 

1 

2 

3 

4 

1 

22 

2 

2 

0 

(22,  22)" 

(0.7,  2.4) 

(3.3,  1.6) 

(0.0,  0.0) 

2 

5 

7 

14 

0 

(2.4,  4.6) 

(7,7) 

(16.6,  14.4) 

(0.0,  0.0) 

3 

0 

2 

36 

0 

(0.8,  0.4) 

(1.2,  1.6) 

(36,  36) 

(0.0,  0.0) 

4 

0 

1 

17 

10 

(1.9,  0.0) 

(3.0,  1.0) 

(13.1,  17.0) 

(10,  10) 

"Quasi-independence  model  fit  followed  by  quasi-symmetry  model  fit. 
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Loglinear  models  can  directly  address  the  association  component  of  agreement.  For  two 
observations,  suppose  each  observer  classifies  one  in  category  a  and  one  in  category  b.  The 
odds  that  the  observers  agree  rather  than  disagree  on  which  is  in  category  a  and  which  is  in 
category  b  equal 


Ttaa^bb  paa  [Abb 

'Aab  —  — 

'Kab^ba  pab  [Aba 


Under  the  quasi-independence  loglinear  model, 


(11.29) 


tab  =  exp(Sa  +Sh). 


Larger  {Sa }  represent  stronger  agreement  and  stronger  association.  For  instance,  for 
Table  11.8,  S2  =  0.60  and  S3  =  1.90,  and  f23  =  12.3.  The  degree  of  agreement/association 
also  seems  quite  strong  for  other  pairs  of  categories. 

11.5.3  Quasi-symmetry  and  Agreement  Modeling 

For  Table  11.8,  the  quasi-independence  model  shows  some  lack  of  fit.  Given  that  the 
pathologists  disagree,  some  association  remains  between  ratings.  For  observer  agreement 
tables,  this  is  common.  Quasi-symmetry  (11.19)  often  fits  much  better,  because  it  permits 
association.  For  Table  1 1.8,  it  has  G 2  =  0.98  (df  =  2).  Table  11.9  displays  the  fit.  It  is 
not  unusual  for  tables  to  have  many  empty  cells.  When  nab  +  nba  =0  for  any  pair  (such 
as  categories  1  and  4  in  Table  11.8),  the  ML  fitted  values  for  quasi-symmetry  in  those 
cells  must  also  be  zero  since  one  of  its  likelihood  equations  is  fLab  +  pba  =  nab  +  nba.  You 
should  eliminate  those  cells  from  the  fitting  process  to  get  the  proper  residual  df  value. 

Under  quasi-symmetry,  iab  =  exp(A.aa  +  Xbb  -  Xab  -  Xba),  where  Xab  —  Xba.  For  cat¬ 
egories  2  and  3  of  Table  11.8,  for  instance,  f23  =  10.7.  The  model  also  yields  informa¬ 
tion  about  similarity  of  marginal  distributions.  The  simpler  symmetry  model  that  forces 
the  margins  to  be  identical  fits  Table  11.8  poorly  (G2  =  39.18,  df  —  5).  The  statistic 
G2(S|QS)  =  39.18  —  0.98  =  38.20  (df  =  3)  provides  strong  evidence  of  marginal  hetero¬ 
geneity.  In  Table  1 1.8,  differences  in  marginal  proportions  are  substantial  in  each  category 
but  the  first.  The  marginal  heterogeneity  is  one  reason  that  the  agreement  is  not  stronger. 

Models  for  agreement  can  take  ordering  of  categories  into  account  (Agresti  2010,  Sec. 
8.5.3).  Conditional  on  observer  disagreement,  a  tendency  usually  remains  for  high  (low) 
ratings  by  one  observer  to  occur  with  relatively  high  (low)  ratings  by  the  other  observer. 

11.5.4  Kappa:  A  Summary  Measure  of  Agreement 

An  alternative  approach  summarizes  agreement  with  a  single  index.  For  nominal  scales, 
the  most  popular  measure  is  Cohen’s  kappa  (Cohen  1960).  It  compares  the  probability  of 
agreement  ]Ca  naa  to  that  expected  if  the  ratings  were  independent,  ira+ir+a,  by 


_  2Za  ^ aa  -'G+7I’+tf 

1  n:a+7T+a 

The  denominator  equals  the  numerator  with  nau  replaced  by  its  maximum  possible 
value  of  1,  corresponding  to  perfect  agreement.  Kappa  equals  0  when  the  agreement  merely 
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equals  that  expected  under  independence.  It  equals  1 .0  when  perfect  agreement  occurs.  The 
stronger  the  agreement,  the  higher  is  k ,  for  given  marginal  distributions.  Negative  values 
occur  when  agreement  is  weaker  than  expected  by  chance,  but  this  rarely  happens. 

For  multinomial  sampling,  the  sample  value  k  has  a  large-sample  normal  distribution. 
Its  estimated  asymptotic  variance  is 

1  (  PJ\  -  P„)  2(1  -  P0)[2P0Pe  ~  £«  Paa(Pa+  +  P+a)] 

=  »  jTT^w  + - <TTw - 

(1  -  Pg )2  [E»  Eft  Pab(Pb+  +  P+a)2  -  4 py\  | 
(1-P,)4  (' 

where  P„  =  J2a  Paa  and  Pe  =  Pa + P+a  (Fleiss  et  al.  1969).  It  is  rarely  plausible  that 

agreement  is  no  better  than  expected  by  chance.  Thus,  rather  than  testing  Hq.  k  =  0,  it  is 
more  relevant  to  estimate  strength  of  agreement  by  interval  estimation  of  k. 

For  Table  11.8,  P0  =  0.636  and  Pc  =  0.281.  Sample  kappa  equals  (0.636  —  0.281)/ 
(1  —  0.281)  =  0.493.  The  difference  between  observed  agreement  and  that  expected  under 
independence  is  about  50%  of  the  maximum  possible  difference.  The  estimated  standard  er¬ 
ror  is  0.057,  so  k  apparently  falls  between  about  0.38  and  0.60,  moderately  strong  agreement. 

11.5.5  Weighted  Kappa:  Quantifying  Disagreement 

Kappa  treats  classifications  as  nominal.  When  categories  are  ordered,  the  seriousness  of  a 
disagreement  depends  on  the  difference  between  the  ratings.  For  nominal  classifications 
also,  some  disagreements  may  be  considered  more  severe  than  others.  The  measure  weighted 
kappa  (Spitzer  et  al.  1967)  uses  weights  { wah }  satisfying  0  <  wah  <  1,  with  all  wu(1  =  1 
and  all  wah  =  wha  to  describe  closeness  of  agreement.  One  possibility  is  {wab  —  1  — 

| a  —  h\/{l  —  1)},  for  which  agreement  is  greater  for  cells  nearer  the  main  diagonal.  Fleiss 
and  Cohen  (1973)  suggested  { w„h  =  1  —  (a  —  h)2/(I  —  l)2).  The  weighted  agreement  is 
Ea  Eft  wabttab  and  weighted  kappa  is 


^  _  Eg  Eft  wubJtab  Ea  Eft  W abtta+tt+b 
1  —  E«  Eft  w ahtta+tt+b 

Controversy  surrounds  the  utility  of  kappa  and  weighted  kappa,  partly  because  their 
values  depend  strongly  on  the  marginal  distributions.  The  same  diagnostic  rating  process 
can  yield  quite  different  values,  depending  on  the  proportions  of  cases  of  the  various  types 
(Exercise  11.42).  In  summarizing  a  contingency  table  by  a  single  number,  the  reduction 
in  information  can  be  severe.  An  alternative  is  to  find  kappa  separately  for  each  outcome 
category,  for  the  2x2  table  in  which  the  other  categories  are  combined.  Or,  models  can 
provide  more  detailed  description  of  the  agreement  and  disagreement  structure,  as  we  noted 
in  Sections  1 1 .5.2  and  1 1 .5.3. 

11.5.6  Extensions  to  Multiple  Observers 

With  several  observers,  ordinary  loglinear  models  are  not  usually  relevant.  Their  description 
of  agreement  and  association  between  two  observers  is  conditional  on  ratings  by  the  others. 
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It  is  more  relevant  to  study  this  marginally,  without  conditioning  on  the  other  ratings. 
Models  for  the  pairwise  agreement  and  association  structure  then  focus  simultaneously  on 
all  pairs  of  two-way  marginal  distributions  (Becker  and  Agresti  1992). 

Other  approaches  have  also  been  used.  For  instance,  generalizations  of  kappa  summarize 
pairwise  agreements  or  multiple  agreements  (Fleiss  et  al.  2003,  Sec.  1 8.3;  Landis  and  Koch 
1977).  A  mixture  model  assumes  latent  classes  of  subjects  for  whom  the  observers  agree 
and  subjects  for  whom  they  disagree.  Such  an  analysis  is  shown  in  Section  14.1.3. 


11.6  BRADLEY-TERRY  MODEL  FOR  PAIRED  PREFERENCES 

Sometimes,  categorical  outcomes  result  from  pairwise  evaluations.  A  common  example  is 
athletic  competitions,  when  the  outcome  for  a  team  or  player  consists  of  categories  (win, 
lose).  Another  example  is  pairwise  comparison  of  product  brands,  such  as  two  brands  of 
wine  of  some  type.  When  a  wine  critic  rates  /  brands  of  New  Zealand  sauvignon  blanc, 
it  might  be  difficult  to  establish  an  outright  ranking,  especially  if  /  is  large.  However,  for 
any  given  pair,  the  critic  could  probably  state  a  preference  after  tasting  them  at  the  same 
occasion.  An  overall  ranking  of  the  wines  could  then  be  based  on  the  pairwise  preferences. 
We  next  present  a  model  for  doing  this. 

11.6.1  Bradley-Terry  Model 

Bradley  and  Terry  ( 1 952)  proposed  a  logistic  model  for  paired  evaluations.  Let  J"jafe  denote 
the  probability  that  a  is  preferred  to  b.  Suppose  that  ]”[„/,  +  Wba  =  '  f°r  a"  pairs;  that  is,  a 
tie  cannot  occur.  The  Bradley-Terry  model  is 

log  flat  =&- ft.  (11.30) 

1  1  ba 


Alternatively, 


Y[  =  exp(/Ju)/[exp(/fu)  +  exp(/3fc)]. 

ah 

Thus,  Wah  —  5  when  fia  —  fib  and  Y[ah  >  |  when  f$a  >  fib-  Identifiability  requires  a  con¬ 
straint  such  as  fit  =  0.  Since  the  model  describes  all  the  pairwise  probabilities  (in,,/,  1  for 

a  <  b)  by  (/  —  1)  parameters,  residual  df  =  —  (I  —  1). 

Fora  <  /?,  let  Nt,b  denote  the  sample  number  of  evaluations,  with  a  preferred  n^b  times 
and  b  preferred  «/,,,  =  Nab  —  nab  times.  A  square  contingency  table  with  empty  cells  on 
the  main  diagonal  summarizes  results.  When  the  Nab  comparisons  are  independent  with 
probability  for  each,  naf,  has  a  bin(/Vaft,  distribution.  If  evaluations  for  different 
pairs  are  also  independent,  ordinary  methods  for  logistic  models  apply  for  fitting  the  model. 

11.6.2  Example:  Major  League  Baseball  Rankings 

Table  1 1.10  shows  results  of  the  201 1  season  for  the  five  baseball  teams  in  the  Eastern 
Division  of  the  American  League.  For  instance,  of  games  between  Boston  and  New  York, 
Boston  won  1 2  and  New  York  won  6  (one  of  the  few  bright  points  in  a  disastrous  season  for 


BRADLEY-TERRY  MODEL  FOR  PAIRED  PREFERENCES 


437 


Table  11.10 

Results  of  2011  Season  for  American  League  (Eastern  Division)  Baseball  Teams 

Losing  Team 

Winning  Team  Boston 

New  York 

Tampa  Bay 

Toronto 

Baltimore 

Boston 

— 

12 

6 

10 

10 

New  York 

6 

— 

9 

11 

13 

Tampa  Bay 

12 

9 

— 

12 

9 

Toronto 

8 

7 

6 

— 

12 

Baltimore 

8 

5 

9 

6 

— 

Source :  www . 

baseball -reference . com/leagues/AL/2011- 

■  standings . 

shtml . 

Table  11.11 

Results  of  Fitting  Bradley-Terry  Model  to  Baseball 

Data  of  Table  11.10 

Team 

Winning  Percentage  ft 

SE 

Boston 

52.8 

0.454 

0.304 

New  York 

54.2 

0.499 

0.305 

Tampa  Bay 

58.3 

0.635 

0.307 

Toronto 

45.8 

0.229 

0.303 

Baltimore 

38.9 

0.000 

— 

the  Red  Sox).  Table  1 1.10  shows  the  population  of  regular-season  games.  We  regard  this 
as  a  sample  estimate  of  a  conceptual  distribution  representing  the  long-run  performance  of 
teams  as  constituted  in  201 1. 

We  fitted  the  Bradley-Terry  model  as  a  logistic  model  for  j  =  10  independent  bi¬ 
nomial  samples,  using  an  appropriate  model  matrix  and  no  intercept.  The  assumption  of 
binomial  sampling  is  suspect,  as  for  any  particular  pair  of  teams  the  probability  of  victory 
for  a  team  may  vary  according  to  factors  not  considered  here,  such  as  the  quality  of  the 
starting  pitcher  for  each  team.  Nonetheless,  the  model  fits  adequately  (G2  =  7.70,  df  =  6). 
Table  11.11  displays  the  sample  proportion  of  games  each  team  won  and  the  model  esti¬ 
mates  of  {ft},  setting  ft  =  0.  When  Boston  played  New  York,  the  estimated  probability 
that  Boston  won  is 


f]p  =  exp(ft)/[exp(ft)  +  exp(ft)]  =0.489. 

The  standard  error  of  each  ft  —  ft  is  about  0.30,  so  not  much  evidence  exists  of  a  difference 
among  these  teams.  The  likelihood-ratio  statistic  for  testing  Hq:  ft  —  ft  —  fts  =  ft  =  ft 
is  5.48  with  df  =  4  (P  =  0.24). 


11.6.3  Example:  Home  Team  Advantage  in  Baseball 

The  analysis  above  does  not  recognize  which  team  is  the  home  team.  Most  sports  have 
a  home  field  advantage:  A  team  is  more  likely  to  win  when  it  plays  at  its  home  city. 
Table  1 1.12  contains  results  for  201 1  according  to  the  (home  team,  away  team)  classifica¬ 
tion.  For  instance,  when  Boston  was  the  home  team,  it  beat  New  York  5  times  and  lost  4 
times;  when  New  York  was  the  home  team,  it  beat  Boston  2  times  and  lost  7  times.  Now 
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Table  11.12  Wins/Losses  by  Home  Team  and  Away  Team,  for  American  League  (Eastern 
Division)  Baseball  Teams  in  2011  Season 


Home  Team 

Away  Team 

Boston 

New  York 

Tampa  Bay 

Toronto 

Baltimore 

Boston 

— 

5-4 

2-7 

6-3 

6-3 

New  York 

2-7 

— 

6-3 

7-2 

7-2 

Tampa  Bay 

5-4 

6-3 

— 

6-3 

3-6 

Toronto 

2-7 

6-3 

5-4 

— 

6-3 

Baltimore 

5-4 

3-6 

3-6 

3-6 

— 

Source:  www .  baseball  -  ref  erence  .  com/games. 


for  all  a  ^  b,  let  denote  the  probability  that  team  a  beats  team  b,  when  a  is  the  home 
team.  Consider  logistic  model 


log  ^  =ct  +  (pa-pb).  (11.31) 

1  1  1  ah 

When  a  >  0,  a  home  field  advantage  exists. 

For  Table  11.12,  model  (1 1.31)  describes  20  binomial  distributions  with  5  parameters. 
It  has  G1  =  19.41  (df  =  15).  The  estimate  of  the  home-field  parameter  is  a  =  0.080  with 
SE  =  0.154.  For  two  evenly  matched  teams,  the  home  team  had  estimated  probability 
c°080/[l  +  c0,080]  =  0.521  of  winning.  In  fact,  of  the  180  games  between  teams  in  this 
division  in  2010,  the  home  team  won  94,  or  52.2%. 

For  this  model,  the  estimated  team  effects  are  0\  =  0.453,  02  =  0.498,  /I3  =  0.482, 
$4  =  0.379,  and  /J5  =  0.000,  with  SE  values  of  about  0.30  for  differences.  When  Boston 
played  New  York,  the  estimated  probability  of  a  Boston  win  was  0.509  at  Boston  and  0.469 
at  New  York.  Perhaps  surprisingly,  the  simpler  model  that  has  a  —  0\  =  fa  =  =  /5 4  = 
/S5  =  0,  corresponding  to  each  game  having  result  comparable  to  flipping  a  fair  coin,  has 
G 2  =  23.54  (df  =  20)  and  does  not  give  a  significantly  poorer  fit  (change  in  deviance  = 
4.13,  df  =  5,  P  =  0.53). 

Model  ( 1 1 .3 1 )  is  a  useful  generalization  of  the  Bradley-Terry  model  whenever  an  order 
effect  exists.  For  instance,  in  pairwise  taste  evaluations,  the  product  tasted  first  may  have  a 
slight  advantage. 

11.6.4  Bradley-Terry  Model  and  Quasi-symmetry 

Fienberg  and  Larntz  (1976)  showed  that  the  Bradley-Terry  model  is  a  logistic  formulation 
of  the  quasi-symmetry  model  (11.19).  For  quasi-symmetry,  given  that  an  observation  is  in 
cell  ( a,b )  or  (b,a),  the  logit  of  the  conditional  probability  of  cell  ( a,b )  equals 


dba 


+  +  >1! ) 


(Xxa-XYa)-{Xxh-Xl)  =  pa-ff, 


where  fia  =  X*  —XYa.  Estimates  {A.^}  and  {P; }  for  quasi-symmetry  yield  {0a}  for  the 
Bradley-Terry  model,  and  recall  that  we  can  constrain  all  XY  —  0  for  quasi-symmetry. 
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11.6.5  Extensions  to  Ties  and  Ordinal  Pairwise  Evaluations 

The  Bradley-Terry  model  extends  to  ordinal  comparisons,  such  as  the  evaluation  scale 
(much  better,  slightly  better,  the  same,  slightly  worse,  much  worse)  in  comparing  two 
products.  With  cumulative  logits  and  an  /- category  evaluation  scale,  let  Yah  denote  the 
response  for  a  comparison  of  a  with  b.  The  model  is 

logit [P(Yah  <  j )]  =  otj  +  (pa  -  ph). 

Since  P(Yah  <  j)  =  P(Yha  >  I  -  j)  =  1  -  P(Yha  <  /  -  j),  it  follows  that  \og\t[P(Yah  < 
j )]  =  —logit [P(Yf,a  <  /  —  j)].  Thus,  necessarily,  aj  =  —o ti-j- 

The  most  common  ordered  preference  scale  is  (win,  tie,  lose).  Then,  ai  =  —a2. 


1 1.7  MARGINAL  MODELS  AND  QUASI-SYMMETRY  MODELS 
FOR  MATCHED  SETS 

Methods  for  matched  pairs  extend  to  matched  sets.  Here  we  present  mainly  the  loglinear 
modeling  approach,  with  a  brief  discussion  of  marginal  models. 


11.7.1  Marginal  Homogeneity,  Complete  Symmetry,  and  Quasi-symmetry 

Let  (T|,  Y2,...,  Yt)  denote  the  T  responses  for  a  randomly  selected  matched  set.  With  / 
response  categories,  a  contingency  table  with  lT  cells  summarizes  the  possible  outcomes. 
Let  j  =  (j  i, . . . ,  yV)  denote  the  cell  having  Y,  =  jt,  t  =  \, ...  ,T.  Let  7tj  =  P(Y,  =  j,, 
t  =  l, T).  Then 


P(Y,  —  j )  —  7T+. 

where  the  j  subscript  is  in  position  t,  and  [P(Y,  =  j),  j  —  1,  •  •  • ,  /|  is  the  marginal  distri¬ 
bution  for  Y, . 

This  T- way  table  satisfies  marginal  homogeneity  if 

P(Y i  =  j)  =  P(Y2  =»  =  •••  =  P(YT  =  j)  for  j  —  1  • 

It  satisfies  complete  symmetry  if 


7t  j  —  71  k 


for  any  permutation  k  =  (k i,  . . . ,  kT)  of  j  =  (j\,  ■  ■  ■ ,  yV)-  Complete  symmetry  implies 
marginal  homogeneity,  but  the  converse  does  not  hold  except  when  T  =  1=  2. 

Complete  symmetry  is  a  loglinear  model.  For  pj  =  nitj,  one  representation  is 

log  [lj  =  hab  m. 

where  a  is  the  minimum  of  (ji, ... ,  jr),  h  is  the  next  smallest, ....  and  m  is  the  maximum. 
In  a  three-way  table,  for  instance,  log  p\22  =  log  p2n  =  log  P-221  =  A.  122-  The  number  of 
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[Xah  m  \  parameters  is  the  number  of  ways  of  selecting  T  out  of  /  items  with  replacement, 
which  is  Thus,  residual  df  =  IT  -  (I  +T  -  1  )!/[7’!(7  -  1)!]. 

An  1 1  table  satisfies  quasi-symmetry  if 


l°g  pj  =  +  ^2j2  +  ‘  '  +  ^TJt  +  ^ ab...m >  (1  1-32) 

where  \ab...m  is  defined  as  in  the  complete  symmetry  model.  It  has  symmetric  association 
and  higher-order  interaction  terms,  but  permits  each  single-factor  marginal  distribution  to 
have  its  own  parameters.  Identifiability  requires  constraints  such  as  X,/  =  0  for  each  t  and 
equating  one  set  of  main-effect  terms  to  0.  This  model  has  (7  —  1  )(T  —  1 )  more  parameters 
than  complete  symmetry. 

When  quasi-symmetry  (1 1.32)  holds,  marginal  homogeneity  is  equivalent  to  complete 
symmetry.  The  statistic 


G2(S|QS)  =  G2(S)-G2(QS) 

tests  marginal  homogeneity.  Under  complete  symmetry,  it  has  large-sample  chi-squared 
distribution  with  df  =  (/  —  1)(T  —  1). 


11.7.2  Types  of  Marginal  Symmetry 

A  general  type  of  symmetry  for  / 1  tables  has  marginal  homogeneity  and  complete  symmetry 
as  special  cases.  For  an  / '  table,  P(Y,t  =  j\, ... ,  Y,k  =  jh),  where  h  is  between  1  and  T ,  is 
an  /(-dimensional  marginal  probability,  h  —  1  giving  single-variable  marginal  probabilities. 
There  is  hth-order  marginal  symmetry  if  for  all  /(-tuples  j  =  (ji,  ... ,  jh),  this  probability 
is  the  same  for  each  permutation  of  j  and  for  all  combinations  t  =  (t\ , . . . ,  t/,)  of  h  of  the 
T  responses. 

For  h  =  1,  first-order  marginal  symmetry  is  marginal  homogeneity.  Second-order 
marginal  symmetry  occurs  if  for  all  t  and  «,  P(Y,  =  a,  Yu  =  b)  is  the  same  and  the  equality 
holds  for  all  pairs  of  outcomes  (a,  b).  In  other  words,  the  two-way  marginal  tables  exhibit 
symmetry,  and  they  are  identical.  Tth-order  marginal  symmetry  in  an  / 7  table  is  complete 
symmetry.  When  /tth-order  symmetry  holds,  /th-order  marginal  symmetry  holds  for  any 
i  <  h.  For  instance,  complete  symmetry  implies  second-order  marginal  symmetry,  which 
itself  implies  marginal  homogeneity. 

This  hierarchy  is  mathematically  attractive.  However,  the  higher-order  symmetries  are 
usually  too  restrictive  to  fit  well  in  practice. 


1 1.7.3  Comparing  Binary  Marginal  Distributions  in  Multiway  Tables 

Usually,  the  multivariate  dependence  structure  among  repeated  responses  is  of  less  interest 
than  their  marginal  distributions.  For  instance,  in  treating  a  chronic  condition  (such  as 
migraine  headaches  or  a  phobia)  with  some  treatment,  the  primary  goal  might  be  to  study 
whether  the  probability  of  success  increases  over  the  T  weeks  of  a  treatment  period.  The 
T  success  probabilities  refer  to  the  T  first-order  marginal  distributions.  In  Sections  1 1.2. 1 
and  1 1 .3  we  compared  marginal  distributions  for  matched  pairs  ( T  =  2)  using  models  that 
apply  directly  to  the  marginal  distributions.  We  now  extend  this  approach  to  T  >2. 
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For  the  binary  case,  the  marginal  logistic  model  (11.6)  for  matched  pairs  extends  to 

logit[/>(T,  =  1)]  =a  +  p„  r=l,...,7\  (11.33) 

with  a  constraint  such  as  =  0  or  a  —  0.  This  model  is  saturated,  describing  T  marginal 
probabilities  by  T  parameters.  Marginal  homogeneity,  for  which  P(Y\  =  1)  =  •  •  •  = 
P(Yt  =  1),  is  the  special  case  /3\  =  ■  ■  ■  =  fir-  Even  though  this  case  has  only  one  pa¬ 
rameter,  ML  fitting  is  not  simple.  Let  it  denote  the  vector  of  the  probabilities  rr,-  for  the 
possible  i.  They  specify  the  joint  distribution  of  (T) .... ,  Yj)  for  the  2T  table  that  cross- 
classifies  the  T  responses.  The  multinomial  likelihood  refers  to  it  rather  than  the  T  marginal 
probabilities  { P(Y ,  =  1)}.  Fitting  methods  are  described  in  Section  12.1.4. 

Let  w,  denote  the  sample  cell  count  in  cell  i.  The  kernel  of  the  log  likelihood  L(it  ) 
is  ni  log77;-  Let  L(p)  denote  the  log  likelihood  evaluated  at  the  sample  proportions 
{ pi  =  ni/n},  the  ML  fit  of  model  (1 1.33).  Let  L(AMH)  denote  the  maximized  log  likelihood 
assuming  marginal  homogeneity.  The  likelihood-ratio  test  of  marginal  homogeneity  (Lipsitz 
et  al.  1990,  Madansky  1963)  uses 

-2[L(KMH)-L(p)]  =  2  J2nil°S(Pi/^Hy  (H-34) 

i 

The  asymptotic  null  chi-squared  distribution  has  df  —  T  —  1 . 

11.7.4  Example:  Attitudes  Toward  Legalized  Abortion 

For  the  GSS  data  in  Table  11.13,  subjects  indicated  whether  they  support  legalized  abortion 
in  each  of  three  situations.  For  this  23  table,  let  Phij  denote  the  expected  frequency  for 
response  sequence  ( h ,  i,  j)  for  the  three  situations.  Consider  the  model 

log  P-hij  —  ^-ab<  ■ 

with  interaction  term  X i  n  when  (/?,/,  7)  =  (1 ,1,1),  Am  when  (h,  /,  j)  =  (1,1,2)  or  (1,2,1)  or 
(2,1,1),  X 122  when  (h,i,  j)  =  (1,2,2)  or  (2,1,2)  or  (2,2,1),  and  X222  when  (h,i,  j)  —  (2,2,2). 
This  model  implies  a  complete  symmetry  pattern  of  probabilities.  Its  fit  has  G 2  =  965.94 
with  df  =  4.  The  lack  of  fit  is  not  surprising,  as,  for  instance,  «2i2  =  1  whereas  /?22i  =  423. 

Adding  main-effect  terms  for  the  three  situations  provides  the  quasi-symmetry  model. 
It  fits  much  better,  having  G2  =  8.30  with  df  =  2.  Table  11.13  shows  its  fitted  values.  Its 
estimated  main-effect  terms  (—4.60,  —5.18,  0)  show  greater  support  for  legalized  abortion 


Table  11.13  Support  for  Legalizing  Abortion  in  Three  Situations 


Sequence  of  Responses  on  the  Three  Items" 

(UJ) 

(1,1,2)  (2,1,1) 

(2,1,2) 

(1,2,1) 

(1,2,2) 

(2,2,1) 

(2,2,2) 

Counts 

466 

3  39 

! 

71 

3 

423 

147 

QS  fit 

466 

0.4  40.2 

2.4 

72.4 

4.2 

420.4 

147 

"Items  (1)  family  has  very  low  income  and  cannot  afford  more  children,  (2)  woman  is  not  married  and  does 
not  want  to  marry  the  man,  (3)  woman's  health  is  seriously  endangered  by  the  pregnancy.  1,  yes;  2,  no. 
Source:  Data  from  2010  General  Social  Survey. 
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when  a  woman’s  health  is  seriously  endangered  than  in  the  other  two  cases.  The  slight 
lack  of  fit  reflects  that  the  joint  distribution  has  departures  from  a  symmetric  association 
structure.  For  example,  the  loglinear  model  with  only  two-factor  association  terms,  which 
has  G2  =  0.33  with  df  —  1 ,  has  fitted  log  odds  ratios  of  4.30  for  situations  1  and  2,  1 .95  for 
situations  1  and  3,  and  2.20  for  situations  2  and  3.  There  is  a  stronger  conditional  association 
between  opinions  on  the  two  non-health-related  situations. 

We  can  test  marginal  homogeneity  by  the  likelihood-ratio  statistic  comparing  symmetry 
and  quasi-symmetry  models,  965.94  —  8.30  =  957.64,  with  df  =  2.  The  likelihood-ratio 
statistic  (1 1.34)  for  directly  testing  marginal  homogeneity  is  647.54,  with  df  =  2.  Either 
statistic  gives  extremely  strong  evidence  of  marginal  heterogeneity.  For  simultaneous  con¬ 
fidence  intervals  comparing  proportions  in  support  of  legalization  for  pairs  of  situations, 
with  overall  error  probability  <  0.05,  the  Bonferroni  method  uses  confidence  coefficient 
(1  —  0.05/3)  =  0.9833  for  each.  From  formula  (1 1.1),  the  estimate  0.029  =  0.471  —  0.441 
of  the  difference  between  situations  (1)  and  (2)  has  an  estimated  standard  error  of  0.0092. 
The  Wald  confidence  interval  for  the  true  difference  is  0.029  ±  2.39(0.0092),  or  (0.01, 
0.05).  The  intervals  are  (0.36,  0.43)  comparing  (3)  and  (1)  and  (0.39,  0.46)  comparing  (3) 
and  (2).  We  infer  that  there  are  differences  of  opinion  between  each  pair  of  situations,  with 
very  large  differences  between  health  (3)  and  the  other  two. 

11.7.5  Marginal  Homogeneity  for  a  Multicategory  Response 

The  binary  marginal  model  (11.33)  extends  to  multinomial  responses.  With  baseline- 
category  logits  for  /  outcome  categories,  the  saturated  model  is 

log  [P(Yl=j)/P(Yt  =  I)]=plj,  t  =  \,...,T,  7  =  1,...,/  — 1.  (11.35) 

Marginal  homogeneity,  whereby  P(Y \  =  j)  —  ■  ■■  =  P(Yt  =  j)  for  j  —  1, ...,/  —  1,  is 
the  special  case  in  which 

Pi  j  —  Plj  —  ■  ■  ■  —  PTj,  j  =  1,  ...,/-  1. 

The  likelihood-ratio  test  of  marginal  homogeneity  comparing  the  two  models  has  form 
(1 1.34)  and  df  =  (T  -  1)(/  -  1). 

For  an  ordinal  response,  an  unsaturated  model  that  is  more  complex  than  marginal 
homogeneity  focuses  on  location  shifts  among  the  T  margins.  One  such  model  is 

logit[/)(T,  <  y')]  =  ay  +  pt,  t  =  l,...,T,  y  =  1 ,...,/—  1 ,  (11.36) 

with  constraint  such  as  Pj  =  0.  This  model  can  be  fitted  using  ML  methodology  presented 
in  Section  12.1.4.  Marginal  homogeneity  is  the  special  case  =  •••  =  Pt,  tests  of  which 
have  df  =  T  —  1 . 

11.7.6  Wald  and  Generalized  CMH  Score  Tests  of  Marginal  Homogeneity 

Let  p j(t)  denote  the  sample  proportion  in  category  7  for  response  Y,,  let 

Pj  =  ]Tpy(0/7’,  dj(t)  =  pj(t)  -  pj. 
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and  let  d  denote  the  vector  of  {dj(t),  t  =  1 , . . . ,  7  —  1,  j  =  1 —  1).  Let  V  denote 
the  estimated  covariance  matrix  of  -Jnd.  Bhapkar  ( 1973)  proposed  the  Wald  statistic 

W  =  ndTV~'d  (11.37) 


for  the  general  alternative.  This  generalizes  (11.15)  and  has  a  large-sample  chi-squared 
distribution  with  df  =  (/  —  1)(7  —  1). 

Other  statistics  are  special  cases  of  the  generalized  multicategory  Cochran- 
Mantel-Haenszel  (CMH)  statistic  of  Section  8.4.3.  Recall  that  for  the  binary  case  (1=2) 
with  matched  pairs  ( 7  =  2),  the  CMH  statistic  applies  to  a  three-way  table  (see,  e.g.,  Table 
1 1.2)  in  which  each  stratum  shows  the  two  outcomes  for  a  given  subject.  A  generalization 
of  Table  1 1 .2  provides  n  strata  of  7  x  /  tables.  The  7th  stratum  gives  the  7  outcomes  for 
subject  k.  Row  t  in  a  stratum  has  a  1  in  the  column  that  is  the  outcome  for  observation  t ,  and 
0  in  all  other  columns  (or  0  in  every  column  if  that  observation  is  missing).  Probability  dis¬ 
tributions  for  the  subject-stratified  setup  naturally  relate  to  subject-specific  models  such  as 
logistic  model  ( 1 1 .23),  rather  than  to  marginal  models.  However,  conditional  independence 
in  this  three-way  table  (given  subject)  corresponds  to  an  exchangeability  among  variables 
in  the  I '  table  that  implies  marginal  homogeneity.  A  generalized  CMH  test  of  conditional 
independence  in  the  7  x  /  x  n  table  also  tests  marginal  homogeneity  using  a  sampling 
distribution  generated  under  the  stronger  exchangeability  condition  (Darroch  1981).  For  an 
ordinal  response  with  fixed  scores,  the  generalized  CMH  statistic  for  detecting  variability 
among  7  means  is  appropriate. 

When  1=2  and  7=2,  this  CMH  approach  is  equivalent  to  McNemar’s  statistic.  When 
/  =  2  but  7  >2,  the  generalized  CMH  statistic  treating  the  7  responses  as  unordered  is 
identical  to  a  statistic  proposed  by  Cochran  (1950,  Exercise  1 1.49),  often  referred  to  as 
Cochran's  Q. 

In  the  next  chapter  we  present  marginal  models  in  more  general  contexts.  We  extend  the 
analyses  of  this  chapter  to  incorporate  matched  sets  with  explanatory  variables. 


NOTES 

Section  11.1:  Comparing  Dependent  Proportions 

11.1  McNemar  generalized:  Altham  (2010),  Copas  (1973),  Gart  (1969),  Kenward  and  Jones 
(1994),  and  Miettinen  (1969)  considered  generalizations  of  matched-pairs  designs.  Altham 
(2010)  discussed  alternatives  to  McNemar’s  test  when  each  response  is  missing  some  obser¬ 
vations.  Miettinen  generalized  the  McNemar  test  to  case-control  sets  having  several  controls 
per  case.  The  Table  1 1 .2  representation  is  then  useful.  Each  of  n  matched  sets  forms  a 
stratum  of  a  2  x  2  x  n  table  with  one  observation  in  column  1  (the  case)  and  several  ob¬ 
servations  in  column  2  (the  controls).  Westfall  et  al.  (2010)  proposed  multiple  comparison 
methods  for  multiple  McNemar  tests  with  dependent  or  independent  samples.  Lloyd  (2008) 
surveyed  exact  methods  and  proposed  an  unconditional  method  based  on  maximizing  the 
P- value  when  using  an  estimate  of  the  nuisance  parameter.  Altham  ( 1971 ),  Consonni  and  La 
Rocca  (2008),  and  Ghosh  et  al.  (2000)  presented  Bayesian  analyses  for  binary  matched  pairs. 
For  some  of  these  and  some  unconditional  approaches,  inferences  about  marginal  homo¬ 
geneity  also  use  the  main-diagonal  observations  (Liang  and  Zeger  1988,  Suissa  and  Shuster 
1991). 
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Section  11.4:  Symmetry,  Quasi-symmetry,  and  Quasi-independence 

11.2  Quasi-symmetry:  For  other  discussion  of  quasi-symmetry  (QS),  see  Darroch  ( 1981 )  and  Mc- 
Cullagh  (1982).  It  contains  as  a  special  case  other  useful  models,  such  as  the  ones  in  Sections 
1  1.4.4  and  1 1.6.4.  Kateri  and  Papaioannou  (1997)  showed  that  under  certain  conditions  QS 
is  the  closest  model  to  complete  symmetry  in  terms  of  Kullback-Leibler  distance.  Gottard 
et  al.  (201 1)  proposed  graphical  models  with  colored  edges  to  represent  QS  structure  among 
multivariate  responses  having  some  common  categorical  scales.  Results  at  the  end  of  Section 
1 1.4.2  relating  conditional  ML  to  QS  extend  to  multiple  occasions  using  a  multivariate  form 
(11.32)  of  QS  (Agresti  1997,  Bhapkar  and  Darroch  1990,  Conaway  1989,  Darroch  1981, 
Tjur  1982;  see  also  Section  14.2.7).  The  effect  in  ordinal  QS  (11.27)  relates  to  the  main 
effect  in  a  subject-specific  adjacent-categories-logit  model  (Agresti  1993).  The  symmetry 
model  generalizes  in  other  ways  for  ordinal  responses,  such  as  with  the  conditional  symmetry 
model 

log(7ru;,/7r,,J  =  t,  a<b 

(Bishop  et  al.  1975,  pp.  285-286),  which  implies  that  for  all  a  <  b, 

P(Yt  =  a,Y2  =  b\Y \  <  Y2)  =  P(T,  =  b,  Y2  =  a|K,  >  Y2). 

For  other  generalizations,  see  Agresti  (2010,  Sec.  8.3),  Goodman  (1979b.  1985),  Hout  et  al. 
(1987),  and  Kateri  and  Agresti  (2007). 

1 1.3  Quasi-independence:  The  term  quasi-independence  originated  in  Goodman  ( 1 968).  A  more 
general  definition  of  it  is  nah  —  a„  fit,  for  some  fixed  set  of  cells.  See  Caussinus  (1966), 
Fienberg  (1970b,  1972),  and  Goodman  (1968).  Caussinus  used  the  concept  to  analyze  tables 
that  deleted  a  certain  set  of  cells  from  consideration,  and  Goodman  used  it  in  earlier  analyses 
of  social  mobility.  Stigler  (1999.  Chap.  19)  summarized  early  uses,  including  Karl  Pearson’s 
handling  in  1913  of  a  triangular  array.  Booth  and  Butler  (1999),  Krampe  et  al.  (201 1),  and 
Smith  et  al.  (1996)  presented  exact  tests  for  square-table  models. 

1 1.4  Occupational  mobility:  Models  for  square  tables  are  applied  often  to  the  study  of  occupational 
(or  social)  mobility.  Each  observation  pairs  parent’s  occupation  with  child's  occupation.  See 
Goodman  (1979b),  Hout  et  al.  (1987),  Sobel  et  al.  (1998),  and  Xie  (1992). 

11.5  Upper-triangular  tables:  In  some  applications  a  table  is  a  priori  symmetric  or  independent, 
but  only  the  pair  (ij)  rather  than  their  order  is  observable,  thus  leading  to  an  upper-triangular 
table  (Altham  1975).  See  Khamis  (1983)  for  examples  and  ML  fitting  of  models  for  such 
three-way  tables  that  are  symmetric  within  layers. 


Section  11.5:  Measuring  Agreement  Between  Observers 

11.6  Kappa,  intraclass  correlation,  QS:  Banerjee  et  al.  (1999)  and  Fleiss  et  al.  (2003,  Chap. 
18)  reviewed  kappa  and  its  generalizations.  See  also  Kraemer  et  al.  (2002),  who  discussed 
versions  of  kappa  and  its  generalizations  that  should  or  should  not  be  used,  and  Landis  et  al. 
(201 1),  who  described  a  concordance  index  that  has  kappa  measures  as  special  cases.  Kappa 
and  weighted  kappa  relate  to  the  intraclass  correlation ,  a  measure  of  interrater  reliability 
for  interval  scales  (Fleiss  1981;  Fleiss  and  Cohen  1973;  Kraemer  1979,  Landis  et  al.  201 1). 
Agresti  (2010,  Sec.  8.5.3),  Becker  and  Agresti  (1992),  Goodman  (1979b),  and  Tanner  and 
Young  (1985)  modeled  agreement  using  loglinear  models.  Darroch  and  McCloud  (1986) 
showed  that  quasi-symmetry  has  an  important  role  in  agreement  modeling. 
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Section  11.6:  Bradley-Terry  Model  for  Paired  Preferences 

11.7  Bradley-Terry  generalized:  Zermelo  (1929)  proposed  a  model  that  is  equivalent  to  the 
Bradley-Terry  model.  Luce  (1959)  provided  an  axiomatic  basis  for  it.  Mosteller  (1951)  and 
Thurstone  (1927)  proposed  an  analogous  model  with  the  probit  link.  An  interview  of  Ralph 
Bradley  by  M.  Hollander  ( Statist .  Sci.  16:  75-100,  2001)  discussed  food-tasting  applications 
that  motivated  its  development.  For  extensions,  see  Bradley  (1976),  David  (1988),  and  Imrey 
(2005).  Fienberg  and  Lamtz  (1976)  and  Imrey  et  al.  (1976)  related  it  to  quasi-independence. 
Dudbridge  (2007)  suggested  it  in  genetics  for  modeling  pairs  of  transmitted/nontransmitted 
alleles  for  multiallelic  markers.  Dittrich  et  al.  ( 1 998)  allowed  covariates.  Matthews  and  Morris 
(1995)  gave  an  application  with  a  factorial  design,  ties,  and  allowance  for  dependence  among 
judgments.  Bockenholt  and  Dillon  (1997)  and  Dittrich  et  al.  (2007)  modeled  dependence  with 
ordinal  preferences. 


Section  11.7:  Marginal  Models  and  Quasi-symmetry  Models  for  Matched  Sets 

11.8  MH/CMH:  Darroch  ( 198 1)  surveyed  thoroughly  the  relationships  among  statistics  fortesting 
marginal  homogeneity  and  their  connections  with  generalized  CMH  analyses.  See  also  Mantel 
and  Byar  (1978)  and  White  et  al.  (1982).  Bergsma  et  al.  (2009)  considered  several  hypotheses 
for  longitudinal  data  in  the  context  of  the  generalized  loglinear  model  (10. 10). 


EXERCISES 

Applications 

11. 1  A  poll  by  Louis  Harris  and  Associates  of  1249  adult  Americans  indicated  that  36% 
believe  in  ghosts  and  37%  believe  in  astrology.  Can  you  compare  these  proportions 
inferentially?  If  yes,  do  so.  If  not,  explain  what  further  information  you  need. 

11.2  Refer  to  Table  1 1 . 1  for  the  2004  and  2008  Presidential  elections.  The  corresponding 
2010  GSS  sample  for  the  585  females  had  counts  266  and  13  in  row  1  and  94  and 
212  in  row  2.  Conduct  all  steps  of  McNemar’s  test,  and  interpret. 

11.3  Table  11.14  shows  data  about  belief  in  heaven  and  belief  in  hell. 

a.  Compare  the  marginal  proportions  using  a  95%  confidence  interval. 

b.  Perform  McNemar’s  test,  and  interpret  results. 

c.  Explain  how  these  data  suggest  that  the  marginal  proportions  are  strongly  de¬ 
pendent,  rather  than  independent.  Explain  why  inferences  in  (a)  are  more  precise 


Table  11.14  Beliefin  Heaven  and  Belief  in  Hell,  for  Exercise  11.3 


Belief  in  Hell 

Belief  in  Heaven 

Yes 

No 

Total 

Yes 

955 

162 

1117 

No 

9 

188 

197 

Total 

964 

350 

1314 

Source:  2008  General  Social  Survey. 
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than  if  we  had  the  same  sample  proportions  but  with  independent  samples  of 
size  1314  each. 

11.4  In  the  2006  GSS,  subjects  were  asked  their  opinion  about  whether  the  federal 
government  should  fund  stem  cell  research.  In  2001,  President  George  W.  Bush 
had  instituted  a  policy  that  barred  the  NIH  from  funding  research  on  embryonic 
stem  cells  beyond  using  the  existing  cell  lines.  For  those  who  responded  that  the 
government  should  definitely  fund  such  research,  was  there  a  change  between  2000 
and  2004  in  the  relative  numbers  voting  for  Bush?  For  the  152  GSS  subjects  in  2006 
who  voted  Democrat  or  Republican  in  each  election  and  who  supported  funding 
stem  cell  research,  89  voted  Democrat  each  time,  52  voted  Republican  each  time,  7 
voted  Democrat  in  2000  and  Republican  in  2004,  and  4  voted  Republican  in  2000 
and  Democrat  in  2004.  Compare  the  marginal  proportions  using  a  95%  confidence 
interval,  and  perform  the  small-sample  analog  of  McNemar’s  test.  Interpret. 

11.5  Refer  to  Table  9. 16  and  Exercise  9.1 .  Treat  the  data  as  matched  pairs  on  opinion, 
stratified  by  gender.  Testing  independence  for  the  2x2  table  using  entries  (6,  1 60) 
in  row  1  and  (11,  181)  in  row  2  tests  equality  of  p  for  logistic  model  ( 1 1 .8)  for  each 
gender.  Explain  why. 

11.6  A  crossover  experiment  with  1 00  subjects  compares  two  drugs  for  treating  migraine 
headaches.  The  response  scale  is  success  (1)  or  failure  (0).  Half  the  study  subjects, 
randomly  selected,  used  drug  A  the  first  time  they  had  a  headache  and  drug  B  the 
next  time.  For  them,  6  responded  ( 1 , 1 )  for  (A,B),  25  responded  (1,0),  10  responded 
(0,1),  and  9  responded  (0,0).  For  the  50  subjects  who  took  the  drugs  in  the  reverse 
order,  10  were  (1,1)  for  (A, 5),  20  were  (1,0),  12  were  (0,1),  and  8  were  (0,0). 

a.  Ignoring  treatment  order,  compare  the  success  probabilities  for  the  two  drugs. 
Interpret. 

b.  McNemar’s  test  uses  only  the  pairs  of  outcomes  that  differ.  For  this  study,  Table 
11.15  shows  such  data  from  both  treatment  orders.  Testing  independence  for 
this  table  tests  whether  success  rates  are  identical  for  the  treatments  (Gart  1969). 
Explain  why.  Analyze  these  data,  and  interpret. 


Table  11.15  Data  for  Exercise  11.6 


Treatment  That  Is  Better 

Treatment  Order 

First 

Second 

A,  then  B 

25 

10 

B,  then  A 

12 

20 

11.7  A  case-control  study  has  8  pairs  of  subjects.  The  cases  have  colon  cancer,  and  the 
controls  are  matched  with  the  cases  on  gender  and  age.  A  possible  explanatory 
variable  is  the  extent  of  red  meat  in  a  subject’s  diet,  measured  as  "1  =  high"  or 
“0  =  low.”  The  (case,  control)  observations  on  this  were  (1,1)  for  3  pairs,  (0,0)  for 
1  pair,  ( 1 ,0)  for  3  pairs,  and  (0, 1 )  for  1  pair. 
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a.  Cross-classify  the  8  pairs  in  terms  of  diet  (1  or  0)  for  the  case  against  diet  (1 
or  0)  for  the  control.  Call  this  Table  A.  Display  the  2  x  2  x  8  table  with  eight 
partial  tables  relating  diet  (1  or  0)  to  response  (case  or  control)  for  the  8  pairs. 
Call  this  Table  B. 

b.  Calculate  the  McNemar  z2  for  Table  A  and  the  CMH  statistic  for  Table  B. 
Compare. 

c.  Show  that  the  Mantel-Haenszel  estimate  (6.7)  of  a  common  odds  ratio  for 
Table  B  is  identical  to  « 12/^21  for  Table  A. 

d.  For  Table  B  with  pairs  deleted  in  which  the  case  and  the  control  had  the  same 
diet,  show  that  the  CMH  statistic  and  the  Mantel-Haenszel  odds  ratio  estimate 
do  not  change. 

e.  This  sample  size  is  small  for  large-sample  tests.  Use  the  binomial  distribution 
with  Table  A  to  find  the  exact  two-sided  P-value. 


11.8  For  Table  11.14  above,  find  the  estimated  odds  ratio  for  comparing  the  distributions 
for  belief  in  heaven  and  for  belief  in  hell,  using  (a)  model  (11 .6)  with  a  population- 
averaged  effect  and  (b)  model  (1 1.8)  with  a  subject-specific  effect,  (c)  Explain  why 
they  differ. 

11.9  Table  11.16  shows  subjects’  purchase  choice  of  instant  decaffeinated  coffee  at  two 
times. 

a.  Fit  the  symmetry  model  and  use  residuals  to  analyze  changes. 

b.  Test  marginal  homogeneity.  Show  that  the  small  P- value  reflects  a  decrease  in 
the  proportion  choosing  High  Point  and  an  increase  in  the  proportion  choosing 
Sanka,  with  no  evidence  of  change  for  the  other  coffees. 

c.  Show  that  quasi-independence  has  G2  =  13.8  (df  =11).  Interpret,  and  suggest 
other  analyses  that  might  be  useful. 

11.10  For  Table  1 1.6,  fit  the  ordinal  quasi-symmetry  model  using  unequally  spaced  but 
sensible  scores.  Compare  results  and  interpretations  to  those  in  Sections  1 1 .3.4  and 
11.4.7. 


Table  11.16  Data  for  Exercise  11.9  on  Coffee  Purchases 


First  Purchase 

Second  Purchase 

High  Point 

Taster’s  Choice 

Sanka 

Nescafe 

Brim 

High  Point 

93 

17 

44 

7 

10 

Taster’s  Choice 

9 

46 

11 

0 

9 

Sanka 

17 

11 

155 

9 

12 

Nescafe 

6 

4 

9 

15 

2 

Brim 

10 

4 

12 

2 

27 

Source:  Based  on  data  from  R.  Grover  and  V.  Srinivasan,  J.  Market.  Res.  24:  139-153,  1987.  Reprinted  with 
permission  from  the  American  Marketing  Association. 
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Table  11.17  Occupational  Mobility  Data  for  Exercise  11.11 


Father’s  Status 

Son’s  Status 

1 

2 

3 

4 

5 

1 

50 

45 

8 

18 

8 

2 

28 

174 

84 

154 

55 

3 

1 1 

78 

110 

223 

96 

4 

14 

150 

185 

714 

447 

5 

3 

42 

72 

320 

411 

Source:  Reprinted  with  permission  from  D.  V.  Glass  (ed.),  Social  Mobility  in  Britain , 

Glencoe,  IL:  Free  Press,  1954. 

11.11  Table  11.17  relates  father’s  and  son’s  occupational  status  for  a  British  sample. 
Analyze  these  data,  using  models  of  (a)  symmetry,  (b)  quasi-symmetry,  (c)  ordinal 
quasi-symmetry,  (d)  marginal  homogeneity,  and  (e)  quasi -independence.  Interpret 
their  fit  and  lack  of  fit. 

11.12  Each  week  Variety  magazine  summarizes  reviews  of  new  movies  by  critics  in 
several  cities.  Each  review  is  categorized  as  pro,  con,  or  mixed,  according  to 
whether  the  overall  evaluation  is  positive,  negative,  or  a  mixture  of  the  two.  Table 
11.18  summarizes  the  ratings  from  April  1995  through  September  1 996  for  Chicago 
film  critics  Gene  Siskel  and  Roger  Ebert. 

a.  Fit  the  symmetry  model,  quasi-independence  model,  and  quasi-symmetry 
model.  Interpret. 

b.  Test  marginal  homogeneity  using  models,  and  interpret. 

c.  Summarize  these  data  using  the  kappa  measure  of  agreement. 


Table  11.18  Data  for  Exercise  11.12  on  Movie  Reviews 


Siskel 

Ebert 

Con 

Mixed 

Pro 

Con 

24 

8 

13 

Mixed 

8 

13 

1 1 

Pro 

10 

9 

64 

Source:  A.  Agresti  and  L.  Winner,  CHANCE  10:  10-14,  1997.  Reprinted 
with  permission,  copyright  1997  by  the  American  Statistical  Association. 

11.13  Table  11.19  displays  multiple  sclerosis  diagnoses  for  two  neurologists  who  clas¬ 
sified  patients  in  Winnipeg  and  in  New  Orleans  with  the  scale  (1)  certain,  (2) 
probable,  (3)  possible,  and  (4)  doubtful,  unlikely,  or  definitely  not.  For  the  New  Or¬ 
leans  patients,  study  the  agreement  using  (a)  the  independence  model  and  residuals, 
(b)  more  complex  models,  and  (c)  kappa.  Interpret  each. 

11.14  For  Exercise  1 1.13,  construct  a  model  that  describes  agreement  between  neurolo¬ 
gists  for  the  two  sites  simultaneously. 
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Table  11.19  Data  for  Exercise  11.13  on  Neurologist  Agreement 


Winnipeg  Neurologist 


New  Orleans 
Neurologist 

Winnipeg  Patients 

New  Orleans  Patients 

1 

2 

3 

4 

1 

2 

3 

4 

1 

38 

5 

0 

1 

5 

3 

0 

0 

2 

33 

1  1 

3 

0 

3 

11 

4 

0 

3 

10 

14 

5 

6 

2 

13 

3 

4 

4 

3 

7 

3 

10 

1 

2 

4 

14 

Source:  J.  R.  Landis  and  G.  G.  Koch,  Biometrics 33:  159-174,  1977.  Reprinted  with  permission  from  the  Biometric 

Society. 

11.15  Calculate  kappa  for  the  4x4  table  having  n,-,  =  5  all  /,  «,.,+i  =  15,  /  =  1,2,3, 
«4i  =  15,  and  n,j  =  0  otherwise.  Use  these  data  to  explain  why  strong  association 
does  not  imply  strong  agreement. 

11.16  In  1990,  a  sample  of  psychology  graduate  students  at  the  University  of  Florida 
made  blind,  pairwise  preference  tests  of  three  cola  drinks.  For  49  comparisons  of 
Coke  and  Pepsi,  Coke  was  preferred  29  times.  For  47  comparisons  of  Classic  Coke 
and  Pepsi,  Classic  Coke  was  preferred  19  times.  For  50  comparisons  of  Coke  and 
Classic  Coke,  Coke  was  preferred  31  times.  Comparisons  resulting  in  ties  are  not 
reported. 

a.  Fit  the  Bradley-Terry  model,  analyze  the  quality  of  fit,  and  rank  the  drinks.  Is 
there  sufficient  evidence  to  conclude  a  preference  for  one  drink? 

b.  Estimate  the  probability  that  Coke  is  preferred  to  Pepsi,  using  the  model,  and 
compare  to  the  sample  proportion. 

11.17  Table  11.20  refers  to  journal  citations  among  four  statistics  journals  during 
1987-1989.  The  more  often  articles  in  a  particular  journal  are  cited,  the  more 
prestige  that  journal  accrues.  For  citations  involving  pair  A  and  B,  view  it  as  a 
victory  for  A  if  it  is  cited  by  B  and  a  defeat  for  A  if  it  cites  B.  Fit  the  Bradley-Terry 
model.  Interpret  the  fit,  and  give  a  prestige  ranking  of  the  journals.  For  citations 
involving  Communications  in  Statistics  and  JRSS-B ,  estimate  the  probability  that 
the  Commun.  Stat.  article  cites  the  JRSS-B  article. 


Table  11.20  Data  for  Exercise  11.17  on  Journal  Citations 


Citing  Journal 

Cited  Journal 

Biometrika 

Commun.  Stat. 

JASA 

JRSS-B 

Biometrika 

714 

33 

320 

284 

Commun.  Stat. 

730 

425 

813 

276 

JASA 

498 

68 

1072 

325 

JRSS-B 

221 

17 

142 

188 

Source:  Stigler  (1994).  Reprinted  with  permission  from  the  Institute  of 
Mathematical  Statistics. 
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Table  11.21  Data  for  Exercise  11.18  on  2009-2011  Men’s 
Tennis  Results 


Winner 

Loser 

Nadal 

Federer 

Murray 

Roddick 

Djokovic 

Nadal 

— 

5 

6 

2 

6 

Federer 

2 

— 

4 

5 

7 

Murray 

3 

4 

— 

2 

1 

Roddick 

1 

0 

1 

— 

4 

Djokovic 

7 

7 

2 

1 

— 

Source:  www .  atpworldtour  .  com. 


11.18  Table  11.21  refers  to  matches  for  several  men  tennis  players  between  January  2009 
and  June  2011. 

a.  Fit  the  Bradley-Terry  model.  Interpret,  and  rank  the  players.  Explain  why  it  is 
probably  not  realistic,  however,  to  treat  the  matches  between  a  particular  pair  of 
players  as  all  having  the  same  probabilities.  [Hint:  They  were  played  on  quite 
varied  surfaces,  including  clay  and  grass.] 

b.  Estimate  the  probability  of  Nadal  beating  Federer.  Compare  the  model  estimate 
to  the  sample  proportion.  Construct  a90%  confidence  interval  for  the  probability. 

11.19  Refer  to  Exercise  3.7  on  basketball  free-throw  shooting.  Analyze  these  data. 

11.20  Refer  to  Exercise  9.5.  The  two-way  table  relating  responses  for  government  spend¬ 
ing  on  the  environment  (as  rows)  and  cities  (as  columns)  has  cell  counts,  by  row, 
(108,  179,  157/21,55,52/5,6,  24).  Analyze  these  data. 

11.21  Analyze  the  data  in  Table  2.13  on  sexual  attitudes  with  methods  presented  in  this 
chapter. 

11.22  Table  1 1.22  comes  from  a  crossover  study  in  which  each  subject  used  each  of 
three  drugs  for  treatment  of  a  chronic  condition  at  three  times.  Show  that  the 
sample  proportion  favorable  was  (0.61,  0.61,  0.35)  for  drugs  (A.  B,  C),  and  that 
the  likelihood-ratio  statistic  for  testing  marginal  homogeneity  is  5.95  (df  =  2),  for 
a  P-value  of  0.051.  Use  the  Bonferroni  method  to  find  simultaneous  95%  Wald 
confidence  intervals  for  the  differences  between  proportions  for  pairs  of  treatments. 


Table  11.22  Crossover  Study  Results  for  Exercise  11.22 


Drug  A  Favorable 

Drug  A  Unfavorable 

B  Favorable  B  Unfavorable 

B  Favorable  B  Unfavorable 

C  Favorable 

6 

2 

2 

6 

C  Unfavorable 

16 

4 

4 

6 

Source:  Reprinted  with  permission  from  the  Biometric  Society  (Grizzle  et  al.  1969). 


EXERCISES 


451 


[Applying  Bonferroni  with  the  score  confidence  interval  gives'  (—0.015,  0.496) 
for  drugs  A  and  C  and  for  drugs  B  and  C,  and  (—0.195,  0.1 95)  for  drugs  A  and  B.] 

11.23  Refer  to  Table  9.3  Viewing  the  table  as  matched  triplets,  construct  the  marginal 
distribution  for  each  of  marijuana,  alcohol,  and  cigarettes.  Test  the  hypothesis  of 
marginal  homogeneity.  Interpret  results. 

11.24  Refer  to  Table  11.1  and  Exercise  11.2.  Regarding  the  data  for  both  males  and 
females  as  a  2  x  2  x  2  table,  use  loglinear  models  to  describe  the  associations 
between  gender  and  Presidential  vote  in  each  year. 


Theory  and  Methods 

11.25  In  genetics,  the  transmission/disequilibrium  test  (TDT)  considers  the  transmission 
of  a  variant  allele  of  a  biallelic  marker  from  heterozygous  parents  to  affected 
children  (Dudbridge  2007).  It  treats  the  untransmitted  allele  as  a  matched  control 
to  the  transmitted  allele.  Show  how  to  express  the  hypothesis  that  heterozygous 
parents  transmit  the  two  alleles  with  equal  probability  in  the  context  of  a  table 
with  row  and  category  columns  (variant,  common).  Explain  why  McNemar’s  test 
is  appropriate  for  testing  that  hypothesis. 

11.26  Explain  the  following  analogy:  McNemar's  test  is  to  binary  data  as  the  paired 
difference  t  test  is  to  normally  distributed  data. 

11.27  For  a  2  x  2  table,  derive  cov(p+| ,  p]  +  ),  and  show  that  v&rl^Jnip+x  -  pi+)]  equals 
(11.1). 


11.28  Consider  the  subject-specific  model  ( 1 1 .8)  for  binary  matched  pairs. 

a.  Show  that  exp(/l)  is  a  conditional  odds  ratio  between  observation  and  outcome. 
Explain  the  distinction  between  it  and  the  odds  ratio  exp(/l)  for  model  (1 1.6). 

b.  Using  the  conditional  distribution  (1 1.9),  show  that  ft  =  log(«2t/«i2)- 

c.  Use  the  delta  method  to  show  ( 1 1 . 1 0)  for  the  SE  of  ji . 

d.  Averaging  over  the  population,  explain  why 


7T2I 


=  E 


1  exp(a,  +  P) 

1  +  exp(a,)  1  +  exp(a,  +  /)) 


where  the  expectation  is  with  respect  to  the  distribution  for  {o',  }.  Similar- 
ily,  state  tz\2-  For  a  random  sample  of  size  n ,  explain  why  as  n  — ►  oo, 
m\/n\2  =  P2\ / P\2~^  exp(/3).  [Hint:  Apply  the  law  of  large  numbers  due  to 
A.  A.  Markov  for  independent  but  not  identically  distributed  random  variables, 
or  use  Chebyshev’s  inequality.] 


Thanks  to  Bernhard  Klingenberg  for  these  results. 
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e.  Show  that  the  Mantel-Haenszel  estimator  (6.7)  of  a  common  odds  ratio  in 
the  2x2  x  n  subject-specific  table  simplifies  to  exp(/l)  =  «2i/«i2  for  the 
population-averaged  table. 

f.  Show  that  the  CMH  statistic  (6.6)  for  a  2  x  2  x  n  subject-specific  table  is 
algebraically  identical  to  the  McNemar  statistic  («2i  —  «i2)2/(«2i  +”12)  for 
the  population-averaged  table. 

11.29  Refer  to  Exercise  1 1.28.  Unlike  the  conditional  ML  estimator  of  ft,  the  uncondi¬ 
tional  ML  estimator  is  inconsistent  (Andersen  1980,  pp.  244-245;  first  shown  by 
him  in  1973).  Show  this  as  follows: 

a.  Assuming  independence  of  responses  for  different  subjects  and  different  obser¬ 
vations  by  the  same  subject,  find  the  log  likelihood.  Show  that  the  likelihood 
equations  are  y+t  =  ]T  P(Yit  —  1 )  and  yi+  -  ]T,  P(Yit  -  1 ). 

b.  Substituting  exp(a,)/[l  +  exp(a,)]  +  exp(a,  +  /?)/[l  +  exp(a,  +  /))]  in  the 
second  likelihood  equation,  show  that  a,  =  —00  for  the  n 22  subjects  with 
y,+  —  0,  &j  =  00  for  the  n  \  \  subjects  with  y,+  =  2,  and  a,  =  —$/2  for  the 
«2i  +  «i2  subjects  with  y,+  =  1. 

c.  By  breaking  ]T(  L(L(,  =  1)  into  components  for  the  sets  of  subjects  having 
yl+  =  0,  y,+  =  2,  and  y,+  =  1,  show  that  the  first  likelihood  equation  is,  fort  = 
1,  y+i  =  «22(0)  +  « 1 1  ( 1 )  +  («2t  -L  « 12) exp(— /3/2)/T  1  +exp(-$/2)].  Explain 
why  y+|  =  nn  +  n ]2,  and  solve  the  first  likelihood  equation  to  show  that  $  = 
2  log(«2i  /n  12).  Hence,  as  a  result  of  Exercise  1 1 .28,  ft  2/1. 

11.30  Consider  marginal  model  (11.6)  when  Y 1  and  Yi  are  independent  and  subject- 
specific  model  (1 1.8)  when  (a,  )  are  identical.  Explain  why  they  are  equivalent. 

11.31  Let  Pm  =  log(p+i  P2+/P+2  P\  +  )  refer  to  marginal  model  (11.6)  and  $c  = 
log(/i2i  / zz  1 2 )  to  subject-specific  model  (1  1.8).  Using  the  delta  method,  show  that 
the  asymptotic  variance  of  y/n{$M  —  Pm)  is 

(^1+^2+)  1  +  (TT-t-l  71+2)  1  —  2(7T||  7T22  —  rri27T2l)/(7ri+ 7T2+ 7T+1  7T+2). 

Under  the  independence  condition  of  the  previous  exercise,  Pm  —  Pc-  In  that  case, 
show  that  the  asymptotic  variances  satisfy 

var  [Vn(pM)]  =  (7U+  7r2+)_l  +  (tt+ 1 7r+2)“' 

<  (7Ti+  7r+2)_l  +  (7T+1  7T2+)-1  =  7T”1  +  7 r^1  =  var[ yfn (Pc ) ] • 

11.32  For  model  (1 1.12)  for  a  matched-pairs  study,  with  the  conditional  ML  approach 
show  that  the  conditional  distribution  satisfies  (11.13)  and  does  not  depend  on  P 
when  Sj  =  0  or  2.  Show  what  happens  to  Pj  in  the  conditional  distribution  for  a 
predictor  for  which  x,,i  =  x,,2  for  all  i. 

11.33  Consider  model  (11.12)  for  a  study  with  matched  sets  of  T  observations  rather 
than  matched  pairs.  Explain  how  (11.13)  generalizes  and  construct  the  form  of  the 
conditional  likelihood. 
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11.34  Give  an  example  illustrating  that  when  I  >  2,  marginal  homogeneity  does  not 
imply  symmetry. 

11.35  Derive  the  likelihood  equations  and  residual  df  for  (a)  symmetry,  (b)  quasi¬ 
symmetry,  and  (c)  quasi-independence. 

11.36  For  the  quasi-symmetry  model  (11.19),  let  Xa  =  X*  —  XYa  .  Show  that  the  model 
can  be  expressed  equivalently  as  log  p,ah  =  X  +  Xa  +  X*h,  with  X*h  =  X*ha.  Hence, 
we  need  only  one  set  of  main-effect  parameters. 

11.37  Show  that  quasi-symmetry  is  equivalent  (Caussinus  1966)  to 

(jtah  nhc  7tca)/(7Tha  Jtch  Jtac)  =  1  all  a,  b,  and  c. 

11.38  Derive  the  covariance  matrix  V  for  the  difference  vector  d  that  is  estimated  in 
expression  (11.15). 

11.39  Construct  the  loglinear  model  satisfying  both  marginal  homogeneity  and  sta¬ 
tistical  independence.  Show  that  nah  =  (p+„  +  pa+)(p+h  +  Pb+)/4  and  residual 
df=  7(7-1). 

11.40  Identify  loglinear  models  that  correspond  to  the  logistic  models,  for  a  <  b, 
log(?W nha)  =  (a)  0,  (b)  r,  (c)  aa  -  ah ,  and  (d)  fi(b  -  a). 

11.41  Consider  the  multiplicative  model  for  a  square  table, 

_  (  otaah(  1  -  P),  a  ^b 

ah  \ct2a+Paa(  ]-ota),  a  =  b. 

a.  Show  that  the  model  satisfies  (i)  symmetry,  (ii)  marginal  homogeneity, 
(iii)  quasi-symmetry,  and  (iv)  quasi-independence. 

b.  Show  that  aa  —  ixa+  =  7t+a,  a  =  1 ,...,/. 

c.  Show  that  fi  =  Cohen’s  kappa,  and  interpret  /3  =  0  and  P  —  I  for  this  model. 

11.42  A  2  x  2  table  has  a  true  odds  ratio  of  10.  Find  the  cell  probabilities  for  which  (a) 
n i+  =  7T+i  =  0.50  and  (b)  TX\+  =  7T+j  =  0.10.  Find  kappa  for  each.  (This  shows 
that  for  a  given  association,  kappa  depends  strongly  on  the  marginal  probabilities.) 

11.43  Consider  the  Bradley-Terry  model  (1 1.30). 

a.  show  that  log  (riu,  / 11™)  =  i°g  (EL  /  EL)  +  log  (EL  /  EE/,)- 

b.  With  this  model,  is  it  possible  that  a  could  be  preferred  to  b  EE h  >  nj 

and  b  could  be  preferred  to  c,  yet  c  could  be  preferred  to  a?  Explain. 

c.  Explain  why  { pa )  are  not  identifiable  without  a  constraint  such  as  Pi  =  0.  [Hint: 
Show  the  model  holds  when  [p*  —  Pa  —  c)  for  any  c.] 

1 1.44  Refer  to  model  ( 1 1 .3 1 )  for  baseball  home-team  advantage. 
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a.  Construct  a  more  general  model  having  home-team  parameters  {/6/y, )  and  away¬ 
team  parameters  {/6a,  ),  such  that  the  probability  team  /  beats  team  j  when  i  is 
the  home  team  is  exp(/)/y,)/[exp(/)tf,)  +  exp(^A>)],  where  =  0  but  /6/y,  is 
unrestricted. 

b.  Interpret  the  case  {/6/y;  =  /1a, ■  +  c),  when  (i)  c  =  0  and  (ii)  c  >  0. 

c.  Fit  the  model  to  Table  11.12.  Compare  the  fit  to  model  (1 1.31).  Compare  [$Hi) 
and  {/3a,  )  to  describe  how  teams  play  at  home  and  away. 

11.45  Find  the  log  likelihood  for  the  Bradley-Terry  model.  From  the  kernel,  show  that 
(given  {Nah})  the  minimal  sufficient  statistics  are  j«a+).  Thus,  explain  how  “victory 
totals”  determine  the  estimated  ranking. 

11.46  Explain  how  to  fit  the  complete  symmetry  model  in  T  dimensions. 

11.47  Prove  that  if  Ath-order  marginal  symmetry  holds,  then  /th-order  marginal  symmetry 
holds  for  any  j  <  A. 

11.48  Suppose  quasi-symmetry  holds  for  an  IT  table.  When  the  table  is  collapsed  over  a 
variable,  show  that  the  model  holds  for  the  7r_1  table  with  the  same  main  effects. 

11.49  Let  yn  =  1  or  0  for  observation  t  on  subject  /,  /  =  1, . . . ,  n,  t  —  1, . . . ,  T .  Let 
y.t  =  E,  yit/n,  y>.  =  E,  yn/T -  and  y  =  E,  E,  yu/m .  Regard  (y,+)  as  fixed, 
and  suppose  each  way  to  allocate  the  y,+  “successes”  to  y,+  of  the  observations  is 
equally  likely. 

a.  Show  that  £(T„)  =  y,.,  var(T„)  =  y,.(l  -  y,-.),  and  cov(L„,  Ylk)  =  -y,.(l  - 
yi.)/(T  —  1 )  for  f  /  k.  [Hint:  The  covariance  is  the  same  for  any  pair  of  cells  in 
the  same  row,  and  var  (E,  L,)  =  0  since  y,+  is  fixed.] 

b.  For  large  n  with  independent  subjects,  explain  why  (Y.\, ...  ,Yj)  is  approx¬ 
imately  multivariate  normal  with  pairwise  correlation  p  =  —  1  /(T  —  1 ).  Con¬ 
clude  that  Cochran’s  Q  statistic  (Cochran  1950) 

_  n2(T  -  l)Er=|(X>  -  y)2 
T  EL,  hO-h) 

is  approximately  chi-squared  with  df  =  (T  —  1).  [One  way  notes  that  if 
(X i , . . . ,  Xj)  is  multivariate  normal  with  common  mean  and  common  variance 
a 2  and  common  correlation  p  for  pairs  ( X ,,  Xk),  then  Er(^  —  '502/°r20  ~  P ) 
is  chi-squared  with  df  =  (T  —  1).  Bhapkar  and  Somes  (1977)  gave  slightly 
weaker  conditions  for  a  chi-squared  limiting  distribution  for  Q.] 

c.  Show  that  Q  is  unaffected  by  deleting  cases  in  which  y,-|  =  •  •  •  =  y,r- 


CHAPTER  12 


Clustered  Categorical  Data: 
Marginal  and  Transitional  Models 


Many  studies  observe  the  response  variable  for  each  subject  repeatedly,  at  several  times  or 
under  various  conditions.  Repeated  categorical  response  data  occur  commonly  in  health- 
related  applications,  especially  in  longitudinal  studies.  For  example,  a  physician  might 
evaluate  patients  taking  a  new  drug  or  a  placebo  at  several  occasions  regarding  whether  the 
treatment  is  successful. 

In  the  next  three  chapters  we  present  models  that  apply  to  data  in  which  repeated 
observations  occur  for  matched  sets,  or  clusters ,  of  observations.  In  a  longitudinal  study, 
a  cluster  consists  of  the  set  of  repeated  observations  over  time  by  a  particular  subject. 
But  the  clustered  responses  need  not  refer  to  different  times.  A  dental  study  might  measure 
whether  there  is  decay  for  each  tooth  in  a  subject’s  mouth.  The  set  of  teeth  within  a  subject’s 
mouth  form  a  cluster.  A  study  of  factors  that  affect  children’s  weight,  measured  as  (normal, 
overweight,  obese),  might  sample  families  and  treat  children  from  the  same  family  as  a 
cluster.  A  toxicity  study  may  observe  a  (survival,  nonsurvival)  response  for  each  fetus  in 
a  litter,  for  a  sample  of  pregnant  mice  exposed  to  various  dosages  of  a  toxin.  Each  litter 
forms  a  cluster. 

In  such  applications,  observations  within  a  cluster  tend  to  be  more  alike  than  observations 
from  different  clusters.  Ordinary  analyses  that  ignore  the  correlation  usually  have  invalid 
standard  errors.  Aitkin  et  al.  (1981)  gave  an  example  from  a  project  about  teaching  styles 
where  this  makes  a  substantive  difference,  the  statistical  significance  of  certain  differences 
being  substantially  reduced  after  allowing  for  correlation  among  children  taught  by  the  same 
teacher.  In  Section  1 1.1.4  we  noted  that  positive  correlation  between  sample  proportions 
results  in  improved  precision  for  estimating  within-subject  effects.  By  contrast,  for  inference 
about  between-subject  effects  (such  as  comparing  genders  or  races),  T  repeated  observations 
for  a  single  subject  do  not  provide  as  much  information  as  T  observations  on  different 
subjects,  and  positive  correlations  result  in  larger  standard  errors  for  such  effects  (Exercise 
12.21). 

In  this  chapter  we  generalize  the  marginal  model  methods  of  Chapter  1 1  for  matched 
pairs  to  clustered  data  also  having  explanatory  variables,  such  as  a  study  that  compares  the 
distribution  of  repeated  observations  for  different  groups  or  treatments.  In  Section  12.1  we 
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introduce  marginal  models  with  explanatory  variables  and  fit  them  using  ML  methods.  In 
Section  12.2  we  fit  marginal  models  by  solving  generalized  estimating  equations  (GEEs). 
This  method  is  a  multivariate  version  of  quasi-likelihood  that  is  computationally  simpler 
than  ML.  Section  12.3  presents  technical  details  about  the  GEEs  approach.  In  Section  12.4 
we  introduce  a  transitional  approach  that  models  each  observation  in  terms  of  outcomes  of 
other  observations,  such  as  time  series  models  that  use  past  observations  to  predict  future 
ones. 


12.1  MARGINAL  MODELING:  MAXIMUM  LIKELIHOOD  APPROACH 

Repeated  measurement  provides  a  multivariate  response  (Fj,  Yi, . . . ,  Yj).  Moreover,  T 
often  varies  by  cluster,  such  as  when  each  cluster  is  a  family  or  when  some  observations 
in  a  cluster  are  missing.  In  this  section  we  consider  marginal  models  for  the  {T,},  fitted  by 
ML.  We  defer  model  fitting  details  to  the  end  of  the  section. 


12.1.1  Example:  Longitudinal  Study  of  Mental  Depression 

Table  12.1  refers  to  a  longitudinal  study  comparing  a  new  drug  with  a  standard  drug  for 
treatment  of  340  subjects  suffering  mental  depression  (Koch  et  al.  1977).  Subjects  were 
classified  into  two  initial  diagnosis  groups  according  to  whether  severity  of  depression  was 
mild  or  severe.  In  each  group,  subjects  were  randomly  assigned  to  one  of  the  two  drugs. 
Following  1  week,  2  weeks,  and  4  weeks  of  treatment,  each  subject’s  suffering  from  mental 
depression  was  classified  as  normal  or  abnormal. 

Table  12.1  shows  four  groups,  the  combinations  of  categories  of  the  two  explanatory 
variables:  treatment  type  and  severity  of  initial  diagnosis.  Since  the  study  observed  the 
binary  response  (depression  assessment)  at  T  =  3  occasions.  Table  12.1  contains  cell 
counts  for  a  23  contingency  table  for  each  group.  The  three  depression  assessments  form  a 
multivariate  response  variable  with  three  components.  The  1 2  marginal  distributions  result 
from  three  repeated  observations  for  each  of  the  four  groups. 

Let  t  denote  the  time  of  measurement.  Denote  observation  t  for  a  subject  by  Y,  —  1  for 
normal  and  Y,  —  0  for  abnormal.  Let  5  denote  the  severity  of  the  initial  diagnosis,  with 
5  =  1  for  severe  and  5  =  0  for  mild.  Let  d  denote  the  drug,  with  d  =  1  for  new  and  d  =  0  for 
standard.  Koch  et  al.  (1977)  noted  that  if  the  time  metric  reflects  cumulative  drug  dosage,  a 


Table  12.1  Cross-Classification  of  Responses  on  Depression  at  Three  Times, 
by  Diagnosis  and  Treatment 


Diagnosis 

Treatment 

Response  at  Three  Times" 

NNN 

NNA 

NAN 

NAA 

ANN 

ANA 

AAN 

AAA 

Mild 

Standard 

16 

13 

9 

3 

14 

4 

15 

6 

New  drug 

31 

0 

6 

0 

22 

2 

9 

0 

Severe 

Standard 

2 

2 

8 

9 

9 

15 

27 

28 

New  drug 

7 

2 

5 

2 

31 

5 

32 

6 

“N,  normal;  A,  abnormal. 

Source:  Reprinted  with  permission  from  the  Biometric  Society  (Koch  et  al.  1977). 


MARGINAL  MODELING:  MAXIMUM  LIKELIHOOD  APPROACH 


457 


Table  12.2  Sample  Marginal  Proportions  of  Normal  Response 
for  Depression  Data  of  Table  12.1 


Diagnosis 

Treatment 

Sample  Proportion 

Week  1 

Week  2 

Week  4 

Mild 

Standard 

0.51 

0.59 

0.68 

New  drug 

0.53 

0.79 

0.97 

Severe 

Standard 

0.21 

0.28 

0.46 

New  drug 

0.18 

0.50 

0.83 

logit  scale  often  has  a  linear  effect  for  the  logarithm  of  time.  They  used  scores  (0,  1 , 2)  for 
t,  the  logs  to  base  2  of  the  week  numbers  (1,2,  and  4). 

Table  12.2  shows  sample  proportions  of  normal  responses  for  the  12  marginal  distribu¬ 
tions.  For  instance,  from  Table  12.1,  the  sample  proportion  of  normal  responses  after  week 
1  for  subjects  with  mild  initial  severity  using  the  standard  drug  was 

(16+  13  +  9  +  3)/(  1 6  +  13  +  9  +  3+  14  +  4+  15  +  6)  =  0.51. 

The  sample  proportion  of  normal  responses  (1)  increased  over  time  for  each  group;  (2) 
increased  at  a  faster  rate  for  the  new  drug  than  the  standard,  for  each  fixed  severity;  and  (3) 
was  higher  for  the  mild  than  the  severe  initial  severity  diagnosis,  for  each  treatment  at  each 
occasion.  In  such  a  study  the  company  that  developed  the  new  drug  would  hope  to  show 
that  patients  have  a  significantly  higher  rate  of  improvement  with  that  drug. 

The  marginal  logistic  model 

logit[P(L,  =  1)]  =  a  +  ft.?  +  fi2d  +  ft? 

has  main  effects  for  the  explanatory  variables  (severity  and  drug)  and  the  variable  (time) 
that  specifies  the  different  components  of  the  multivariate  response.  Its  linear  time  effect 
ft  is  the  same  for  each  group. 

The  natural  sampling  assumption  is  multinomial  for  the  eight  cells  in  the  23  cross¬ 
classification  of  the  three  responses,  independently  for  the  four  groups.  However,  the  model 
refers  to  12  marginal  probabilities  (for  2  drug  treatments  x  2  severity  diagnoses  x  3  time 
points)  rather  than  the  4  x  23  =  32  cell  probabilities  in  the  product  multinomial  likelihood 
function.  The  three  marginal  binomial  variates  for  each  group  are  dependent.  ML  estimation 
requires  an  iterative  routine  for  maximizing  the  product  multinomial  likelihood,  subject  to 
the  constraint  that  the  marginal  probabilities  satisfy  the  model.  An  algorithm  for  this  is 
described  in  Section  12.1.4. 

A  check  of  model  fit  compares  the  32  cell  counts  in  Table  12.1  to  their  ML  fitted  values. 
Since  the  model  describes  12  marginal  logits  using  four  parameters,  residual  df  =  8.  The 
deviance  G2  =  34.6.  The  poor  fit  is  not  surprising.  The  model  assumes  a  common  rate  of 
improvement  ft  over  time,  but  Table  12.2  suggests  a  faster  rate  for  the  new  drug. 

A  more  realistic  model  permits  the  time  effect  to  differ  by  drug, 


logit[ P(Yt  —  1)]  —  a  +  fts  +  ftc?  +  ftt  +  ft(<f  x  t). 
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Its  ML  time  effect  estimates  are  $3  =  0.48  ( SE  =  0.12)  for  the  standard  drug  (d  =  0) 
and  =  1.49  (SE  =  0.14)  for  the  new  one  ( d  =  1).  For  the  new  drug,  the  slope 

is  /34  =  1.01  (SE  =  0.18)  higher  than  for  the  standard,  giving  strong  evidence  of  faster 
improvement.  This  model  fits  much  better,  with  G 2  =  4.2  (df  =  7).  The  G2  decrease  of 
34.6  —  4.2  =  30.4  compared  to  the  simpler  model  is  the  likelihood-ratio  test  of  Hn:  =  0, 
a  common  time  effect  for  each  drug. 

The  estimate  of  the  severity  effect  is  fi\  —  —1.29  (SE  —  0.14).  For  each  drug  x  time 
combination,  the  estimated  odds  of  a  normal  response  when  the  initial  diagnosis  was  severe 
equal  exp(— 1.29)  =  0.27  times  the  estimated  odds  when  the  initial  diagnosis  was  mild.  The 
estimate  $ 2  —  —0.06  (SE  =  0.22)  indicates  an  insignificant  difference  between  the  drugs 
after  one  week  (for  which  t  =  0).  At  time  t,  the  estimated  odds  of  normal  response  with  the 
new  drug  are  exp(— 0.06  +  1 .01  t)  times  the  estimated  odds  for  the  standard  drug,  for  each 
severity  level.  In  summary,  severity,  drug  treatment,  and  time  all  have  substantial  effects  on 
the  probability  of  a  normal  response. 


12.1.2  Modeling  a  Repeated  Multinomial  Response 

Models  for  marginal  distributions  of  a  repeated  binary  response  generalize  to  multicategory 
(I  >  2)  responses.  At  observation  t,  the  marginal  response  distribution  has  (/  —  1)  logits. 
For  a  particular  marginal  logit,  a  model  has  the  form 

logit  j(t)  =  ctj  +  p'jX,,  y  =  1 ,...,/—  1 ,  t  =  \,...,T. 

For  a  nominal  response,  we  can  use  a  baseline-category  logit,  logit  (t)  =  log[P(T,  =  /)/ 
P(Y,  =  I )],  to  describe  the  odds  of  each  outcome  relative  to  a  baseline.  For  ordinal  re¬ 
sponses,  we  can  use  the  cumulative  logit,  Iogity  (r)  =  logit[F(T,  <  j)].  With  P}  =  ft  for 
all  y,  the  model  takes  the  proportional  odds  form  with  the  same  effects  for  each  logit. 

12.1.3  Example:  Insomnia  Clinical  Trial 

Table  12.3  shows  results  of  a  randomized,  double-blind  clinical  trial  comparing  an  active 
hypnotic  drug  with  a  placebo  in  239  patients  who  have  insomnia  problems.  The  response 
is  the  patient’s  reported  time  to  fall  asleep  after  going  to  bed,  grouped  into  invervals  of 
minutes.  Patients  responded  before  and  following  a  two-week  treatment  period.  The  two 
treatments,  active  and  placebo,  form  a  binary  explanatory  variable.  The  subjects  receiving 
the  two  treatments  were  independent  samples. 

Table  12.4  displays  sample  marginal  distributions  for  the  four  treatment  x  occasion 
combinations.  From  the  initial  to  follow-up  occasion,  time  to  falling  asleep  seems  to  shift 
downward  for  both  treatments.  The  degree  of  shift  seems  greater  for  the  active  treatment, 
indicating  possible  interaction.  The  response  is  a  discrete  version  of  a  continuous  variable, 
so  by  the  derivation  in  Section  8.2.3  a  cumulative  link  model  is  natural.  The  proportional 
odds  version  of  the  cumulative  logit  model, 

logit[P(y,  <  y')]  =  aj  +  pxt  +  fax  +  Pi(t  x  x),  (12.1) 

permits  interaction  between  t  =  occasion  (0  =  initial,  1  =  follow-up)  and  x  =  treatment 
(0  =  placebo,  1  =  active),  but  assumes  the  same  effects  for  each  response  cutpoint. 
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Table  12.3  Time  to  Falling  Asleep,  by  Treatment  and  Occasion 


Time  to  Falling  Asleep 
Follow-up 


Treatment 

Initial 

<20 

20-30 

30-60 

>60 

Active 

<20 

7 

4 

I 

0 

20-30 

1 1 

5 

2 

2 

30-60 

13 

23 

3 

1 

>60 

9 

17 

13 

8 

Placebo 

<20 

7 

4 

2 

1 

20-30 

14 

5 

1 

0 

30-60 

6 

9 

18 

2 

>60 

4 

1 1 

14 

22 

Source:  From  S.  F.  Francom,  C.  Chuang-Stein,  and  J.  R.  Landis,  Statist.  Med.  8: 
571-582,  1989.  Reprinted  with  permission  from  John  Wiley  &  Sons  Ltd. 


For  ML  model  fitting,  G 2  —  8.0  (df  =  6)  for  comparing  observed  to  fitted  cell  counts 
in  modeling  the  12  marginal  logits  using  these  six  parameters.  The  ML  estimates  are 
ft  =  1.074  ( SE  =  0. 162),  ft  =  0.046  (SE  =  0.236),  and  ft  =  0.662  ( SE  =  0.244).  This 
shows  evidence  of  interaction.  At  the  initial  observation,  the  estimated  odds  that  time  to 
falling  asleep  for  the  active  treatment  is  below  any  fixed  level  equal  exp(0.046)  =  1.04 
times  the  estimated  odds  for  the  placebo  treatment;  at  the  follow-up  observation,  the 
effect  is  exp(0.046  +  0.662)  =  2.03.  In  other  words,  initially  the  two  groups  had  similar 
distributions,  as  expected  by  the  randomization  of  subjects  to  treatment  groups,  but  at  the 
follow-up  those  with  the  active  treatment  tended  to  fall  asleep  more  quickly. 

For  simpler  interpretation,  it  can  be  helpful  to  report  sample  marginal  means  and  their 
differences.  With  response  scores  { 10,  25,  45,  75}  for  time  to  fall  asleep,  the  initial  means 
were  50.0  for  the  active  group  and  50.3  for  the  placebo.  The  difference  in  means  between 
the  initial  and  follow-up  responses  was  22.2  for  the  active  group  and  13.0  for  the  placebo. 
The  difference  between  these  differences  of  means  equals  9.2,  with  SE  =  3.0,  indicating 
that  the  change  was  significantly  greater  for  the  active  group. 

12.1.4  ML  Fitting  of  Marginal  Logistic  Models:  Constraints  on  Cell  Probabilities 

ML  fitting  of  marginal  logistic  models  is  awkward.  For  T  observations  on  an  /-category 
response,  at  each  setting  of  predictors  the  likelihood  refers  to  /7  multinomial  joint 


Table  12.4  Sample  Marginal  Distributions  for  Insomnia  Data  of  Table  12.3 


Time  to  Falling  Asleep 


Treatment 

Occasion 

<20 

20-30 

30-60 

>60 

Active 

Initial 

0.101 

0.168 

0.336 

0.395 

Follow-up 

0.336 

0.412 

0.160 

0.092 

Placebo 

Initial 

0.117 

0.167 

0.292 

0.425 

Follow-up 

0.258 

0.242 

0.292 

0.208 
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probabilities,  but  the  model  applies  to  T  sets  of  marginal  multinomial  parameters 
{P(Y,  -  k),  k  =  1 . /}. 

Let  k  denote  the  complete  set  of  multinomial  joint  probabilities  for  all  settings  of 
predictors.  Marginal  logistic  models  have  the  generalized  loginear  model  form 

Clog(Arr)  =  X$  (12.2) 

introduced  in  Section  10.5.1.  In  the  binary  case,  the  matrix  A  applied  to  n  forms  the  T 
marginal  probabilities  {P(Y,  =  1)}  and  their  complements  at  each  setting  of  predictors. 
The  matrix  C  applied  to  the  log  marginal  probabilities  forms  the  T  marginal  logits  for  each 
setting;  each  row  of  C  has  1  in  the  position  multiplied  by  the  log  numerator  probability  for 
a  given  marginal  logit,  —1  in  the  position  multiplied  by  the  log  denominator  probability, 
and  0  elsewhere. 

For  instance,  for  the  model  of  marginal  homogeneity  in  a  2T  table  with  no  covariates, 
logit[P(T,  =  1)]  =or,  t  =  1 . T, 

is  a  single  parameter,  denoted  by  a  here.  For  T  =  2,  n  has  four  elements,  and  this 
model  is 

110  0  7Z\  | 

0  0  11  TT\2 

10  10  Tl2\ 

0  10  1  7X22 

setting  both  logit(7Tn  +  tt^)  =  logit[F(Ti  =  1)]  and  logit(7Tu  +  JT21)  =  logitf/3 (K2  =  1)] 
equal  to  a. 

The  likelihood  function  £(k)  for  a  marginal  logistic  model  is  the  product  of  the  multi¬ 
nomial  mass  functions  from  the  various  predictor  settings.  One  approach  for  ML  fitting 
views  the  model  as  a  set  of  constraints  and  uses  methods  for  maximizing  a  function  subject 
to  constraints.  In  model  (12.2),  let  V  denote  a  full  column  rank  matrix  such  that  the  space 
spanned  by  the  columns  of  U  is  the  orthogonal  complement  of  the  space  spanned  by  the 
columns  of  X.  Then,  UT X  =  0,  and  the  model  has  the  equivalent  constraint  form 

UTC  log(Ajr)  =  0. 

For  instance,  for  marginal  homogeneity  in  a  2  x  2  table  with  (12.2)  as  expressed  above, 
U1  =  (1,  —1).  Then,  UT  applied  to  C  log(Aw)  sets  the  difference  between  the  row  and 
column  marginal  logits  equal  to  0. 

This  method  of  maximizing  the  likelihood  incorporates  these  model  constraints  as  well 
as  identifiability  constraints,  which  constrain  the  response  probabilities  at  each  predictor 
setting  to  sum  to  1.  We  express  this  collection  of  model  constraints  UT C  log(Ajr)  =  0 
and  identifiability  constraints  as  f(n)  =  0.  The  method  introduces  Lagrange  multipliers 
corresponding  to  these  constraints  and  solves  the  Lagrangian  likelihood  equations  using  a 
Newton-Raphson  algorithm  (Aitchison  and  Silvey  1958,  Haber  1985).  Let  0  be  a  vector 
having  elements  n  and  the  Lagrange  multipliers  k.  The  Lagrangian  likelihood  equations 
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have  form  h(0)  =  0,  where 

h{6)  =  h(n,  k)  =  (/(*),  3  log [t(n)\/dn  +  [d  f(ji)/dji]T k)T 


is  a  vector  with  terms  involving  the  contrasts  in  marginal  logits  that  the  model  specifies  as 
constraints  as  well  as  log-likelihood  derivatives. 

The  Newton-Raphson  method  then  applies  as 


0<'+'>  =  e{,)  - 


'dh{6u))~\  ‘ 
361 


/;(6>(,)). 


t  =  1 . 


This  can  be  computationally  intensive  because  the  derivative  matrix  inverted  has  dimensions 
larger  than  the  number  of  elements  in  n.  A  refinement  (Lang  1996a,  Lang  and  Agresti  1994) 
uses  an  asymptotic  approximation  to  a  reparameterized  derivative  matrix  that  has  a  much 
simpler  form,  requiring  inverting  only  a  diagonal  matrix  and  a  symmetric  positive  definite 
matrix. 

This  ML  marginal  fitting  method  is  available  in  specialized  software.  (The  computing 
Appendix  at  the  text  website  describes  an  R  function,  mphfit ,  developed  by  J.  Lang.)  The 
method  makes  no  assumption  about  the  model  that  describes  the  joint  distribution  n .  Thus, 
when  the  marginal  model  holds,  the  ML  estimate  of  /?  in  (12.2)  is  consistent  regardless  of 
the  dependence  structure  for  that  distribution. 


12.1.5  ML  Fitting  of  Marginal  Logistic  Models:  Other  Methods 

Alternative  approaches  have  been  proposed  for  ML  fitting  of  marginal  models.  Lang  and 
Agresti  (1994)  showed  how  to  simultaneously  fit  a  marginal  model  and  an  unsaturated 
loglinear  model  for  it.  For  example,  with  binary  matched  pairs  we  can  specify  marginal 
logistic  models  for  fj  in  terms  of  x  and  for  Yj  in  terms  of  x,  and  simultaneously  model  the 
log  odds  ratio  between  Y\  and  Y2  in  terms  of  x.  The  complete  model  can  be  specified  as 
a  special  case  of  (12.2)  and  fitted  using  the  constraint  approach  with  Lagrange  multipliers 
just  described.  In  standard  cases,  the  marginal  and  joint  model  parameters  are  orthogonal.  If 
the  marginal  model  holds,  the  ML  estimator  of  the  marginal  model  parameters  is  consistent 
even  if  the  model  for  the  joint  distribution  is  incorrect. 

Fitzmaurice  and  Laird  (1993)  proposed  a  related  ML  approach.  A  one-to-one  corre¬ 
spondence  holds  between  ji  and  parameters  of  the  saturated  loglinear  model.  They  used  a 
further  one-to-one  correspondence  between  the  main  effect  and  the  higher-order  parameters 
of  that  loglinear  model  with  the  marginal  probabilities  and  those  same  higher-order  loglin¬ 
ear  parameters.  Models  were  then  specified  separately  for  the  marginal  probabilities  and 
the  higher-order  (conditional)  loglinear  parameters.  The  likelihood  is  then  maximized  in 
terms  of  the  two  sets  of  model  parameters.  Again,  the  two  sets  of  parameters  are  orthogonal, 
so  the  ML  estimator  of  marginal  model  parameters  is  consistent  when  the  marginal  model 
holds.  This  mixed  parameter  approach  is  also  available  in  specialized  software  (Kastner 
et  al.  1997). 

Yet  another  ML  approach  uses  a  one-to-one  correspondence  between  it  and  parameters 
that  describe  the  marginal  distributions,  the  bivariate  distributions,  the  trivariate  distri¬ 
butions,  and  so  on  (e.g.,  Glonek  and  McCullagh  1995.  Molenberghs  and  Lesaffre  1994, 
1999).  Multivariate  logistic  models  then  apply  to  the  component  distributions,  although  for 
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simplicity  some  higher-order  effects  may  be  assumed  to  vanish.  Glonek  (1996)  proposed  a 
hybrid  of  this  and  the  Fitzmaurice  and  Laird  (1993)  approach. 

Finally,  pseudo-likelihood  approaches  replace  the  likelihood  function  by  a  much  simpler 
one,  such  as  with  the  composite  likelihood  approach  that  uses  contributions  to  the  likelihood 
function  for  all  pairs  of  observations.  For  an  overview,  see  Varin  et  al.  (201 1 ). 


12.2  MARGINAL  MODELING:  GENERALIZED  ESTIMATING 
EQUATIONS  (GEEs)  APPROACH 

For  ML  fitting  of  marginal  models,  at  each  combination  of  predictor  values  we  assume  a 
multinomial  distribution  over  IT  cell  probabilities  for  the  T  observations  on  an  /-category 
response.  As  T  increases,  the  number  of  multinomial  probabilities  increases  dramatically. 
Currently,  the  ML  fitting  approaches  described  in  the  previous  section  are  not  practical 
when  T  is  large  or  there  are  many  predictors,  especially  when  at  least  one  is  continuous. 

An  alternative  to  ML  fitting  uses  a  multivariate  generalization  of  quasi-likelihood  (Sec¬ 
tion  4.7).  Recall  that  the  (univariate)  quasi-likelihood  method,  rather  than  assuming  a 
particular  distribution  for  F,  specifies  only  the  first  two  moments;  it  links  the  mean  to  a 
linear  predictor  and  also  specifies  how  the  variance  depends  on  the  mean.  The  estimates  are 
solutions  of  estimating  equations  that  are  likelihood  equations  under  the  further  assumption 
of  a  distribution  in  the  exponential  family  with  that  mean  and  variance. 

12.2.1  Generalized  Estimating  Equations  Methodology:  Basic  Ideas 

As  in  the  univariate  case,  the  quasi-likelihood  method  specifies  a  model  for  pt,  =  E(Y,)  and 
specifies  a  variance  function  v(yu.)  that  describes  how  var(F,)  depends  on  pt,.  Now,  though, 
that  model  applies  to  the  marginal  distribution  for  each  Yt.  The  method  also  requires  a 
working  guess  for  the  correlation  structure  among  {F,).  The  estimates  are  solutions  of 
equations  called  generalized  estimating  equations.  The  method  is  often  referred  to  as  the 
GEE  method.  Liang  and  Zeger  (1986)  proposed  it  for  marginal  modeling  with  GLMs.  We 
outline  concepts  here  and  give  more  technical  details  in  Section  12.3. 

In  the  GEE  approach,  we  specify  a  variance  function  and  a  pairwise  “working  correla¬ 
tion”  pattern  for  (F|,  F 2,  •  •  • ,  Yt),  but  we  do  not  need  to  assume  a  particular  multivariate 
distribution.  A  popular  working  correlation  structure  is  the  exchangeable  one  that  treats 
corr(F,,  Fv)  as  identical  for  all  s  and  t.  For  repeated  measurement  over  time,  also  popular 
is  the  autoregressive  structure,  corr(F, ,  Fv)  =  which  treats  observations  farther  apart 

in  time  as  more  weakly  correlated.  More  generally,  an  unstructured  working  correlation 
permits  a  separate  correlation  for  each  pair.  In  the  other  direction,  a  simple  independence 
structure  treats  {F, }  as  pairwise  independent.  However,  the  correlation  is  not  the  ideal  pa¬ 
rameter  for  describing  association  with  categorical  variables.  Section  12.3.5  presents  an 
adaptation  of  the  GEE  method  based  on  choosing  a  working  structure  for  the  odds  ratios  to 
determine  the  relevant  covariance  matrix  used  in  the  generalized  estimating  equations. 

The  choice  for  the  working  correlation  determines  the  GEE  estimates  of  model  parame¬ 
ters  /?  describing  effects  of  explanatory  variables  on  E(Y,)  and  their  model-based  standard 
errors.  For  example,  under  the  independence  structure,  the  estimates  are  identical  to  the  ML 
estimates  obtained  by  treating  all  observations  within  and  between  clusters  as  independent. 
Usually,  little  a  priori  information  is  available  about  the  correlation  structure,  and  it  is 
regarded  as  a  nuisance.  The  GEE  estimates  of  model  parameters  are  valid,  however,  even 
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if  we  misspecify  the  covariance  structure.  Specifically,  suppose  that  the  model  is  correct  in 
the  sense  that  the  chosen  link  function  and  linear  predictor  truly  describe  how  E(Y, )  depend 
on  the  explanatory  variables,  t  =  1 ,  . . . ,  T .  Then  the  GEE  model  parameter  estimators  are 
consistent  (i.e.,  the  estimators  converge  in  probability  to  the  true  parameters).  In  practice,  a 
chosen  model  is  never  exactly  correct.  This  consistency  result  is  useful,  however,  for  sug¬ 
gesting  that  the  correlation  structure  need  not  adversely  affect  this  aspect  of  the  estimates, 
for  whatever  model  we  use. 

Although  the  model  parameter  estimates  are  usually  fine  whatever  working  correlation 
assumption  we  choose,  their  model-based  standard  errors  are  not.  More  appropriate  standard 
errors  result  from  an  adjustment  the  GEE  method  can  make  using  the  empirical  dependence 
the  data  exhibit.  The  standard  errors  based  on  the  working  correlation  assumption  are 
updated  using  the  empirical  dependence  to  yield  more  appropriate  ( robust )  standard  errors. 
So,  even  if  we  select  a  seemingly  inappropriate  working  correlation  structure  such  as 
independence,  the  empirical  standard  errors  produced  by  the  GEE  method  reflect  the 
sample  dependence. 

Choosing  a  working  correlation  structure  that  well  approximates  the  true  correlations 
can  pay  benefits  regarding  efficiency  of  estimation  of  /?.  It  may  seem  safest  to  use  the 
unstructured  correlation  structure.  When  T  is  large,  however,  this  approach  can  suffer 
some  loss  of  efficiency  because  of  the  large  number  of  correlation  parameters  that  need 
to  be  estimated.  Liang  and  Zeger  (1986)  noted  that  estimators  based  on  independence 
working  correlation  can  have  surprisingly  good  efficiency  when  the  actual  correlation  is 
weak  to  moderate.  However,  Fitzmaurice  (1995)  showed  that  efficiency  can  suffer  for 
estimating  the  effect  of  an  explanatory  variable  that  varies  within  each  cluster,  especially 
when  correlations  between  responses  are  moderately  strong.  To  check  the  sensitivity  to  the 
selection  or  working  correlation  structure,  we  can  compare  results  for  different  choices. 
In  our  experience,  when  the  correlations  are  modest,  all  working  correlation  structures 
yield  similar  GEE  estimates  and  standard  errors,  as  the  empirical  dependence  has  a  large 
impact  on  adjusting  the  model-based  standard  errors.  If  they  differed  substantially,  a  more 
careful  study  of  the  correlation  structure  would  be  necessary.  Unless  we  expect  dramatic 
differences  among  the  correlations,  we  recommend  the  exchangeable  working  correlation 
structure.  This  recognizes  the  dependence  at  the  cost  of  only  one  extra  parameter. 

The  GEE  approach  is  appealing  for  categorical  data  because  of  its  computational  simplic¬ 
ity  compared  with  ML.  Advantages  include  not  requiring  specification  of  a  joint  distribution 
for  (Ki,  Y2,  •  •  • ,  Yj),  and  the  consistency  of  estimation  even  with  misspecified  correlation 
structure.  However,  it  has  limitations.  Since  the  GEE  approach  does  not  completely  specify 
the  joint  distribution,  it  does  not  have  a  likelihood  function.  Likelihood-based  methods  are 
not  available  for  testing  fit,  comparing  models,  and  conducting  inference  about  parameters, 
as  explained  in  Section  12.3.3.  In  fact,  some  statisticians  (e.g.,  Lindsey  and  Lambert  1999) 
are  critical  of  the  GEE  approach  because  of  the  lack  of  likelihood  or  possible  conflicts 
between  the  nature  of  subject-specific  effects  and  marginal  effects.  Others  do  not  find  this 
problematic,  as  they  regard  GEE  as  an  estimation  method  rather  than  a  model. 

12.2.2  Example:  Longitudinal  Mental  Depression  Revisited 

For  Table  12.1,  comparing  two  treatments  for  mental  depression,  in  Section  12.1.1  we  used 
ML  to  fit  a  logistic  model 


logit[F’(T,  =  1)]  =  a  +  s  +  p2d  +  +  fi4(d  x  t). 
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Table  12.5  Output  from  Using  GEE  to  Fit  Logistic  Model  to  Table  12.1, 
Using  Exchangeable  Working  Correlation  Structure 

Analysis  Of  GEE  Parameter  Estimates 


Empirical  Standard  Error  Estimates  Model-Based  Standard  Error  Estimates 


Parameter 

Estimate 

Std  Error 

Parameter 

Estimate 

Std  Error 

Intercept 

-0 . 0281 

0 . 1742 

Intercept 

-0 . 0281 

0 . 1638 

severity 

-1.3139 

0 . 1460 

severity 

-1 .3139 

0 . 1459 

drug 

-0.0593 

0.2286 

drug 

-0 . 0593 

0.2222 

time 

0.4825 

0.1199 

time 

0 .4825 

0 . 1150 

drug* time 

1 . 0172 

0 . 1877 

drug* time 

1 . 0172 

0 . 1891 

with  drug  x  time  interaction.  The  GEE  analysis  provides  similar  results,  regardless  of  the 
choice  of  working  correlation  structure.  With  the  exchangeable  structure,  the  estimated 
common  correlation  between  pairs  of  the  three  responses  is  —0.003.  The  successive  obser¬ 
vations  apparently  have  pairwise  appearance  like  independent  observations.  This  is  unusual 
for  repeated  measurement  data.  For  this  reason,  similar  results  occur  from  fitting  the  model 
by  treating  the  three  observations  for  a  subject  as  if  they  came  from  three  separate  subjects, 
that  is,  assuming  3  x  340  =  1020  independent  observations  rather  than  three  correlated 
observations  for  each  of  340  subjects. 

Table  12.5  shows  results  using  the  exchangeable  working  correlation  structure.  The 
empirical  standard  errors  incorporate  the  sample  dependence  to  adjust  the  exchangeable 
model-based  standard  errors.  Here,  there  is  not  much  change,  since  estimated  correlations 
in  the  unstructured  case  do  not  vary  much  (0.07  for  times  1  and  2.  -0.03  for  times  1  and 
3,  and  —0.06  for  times  2  and  3). 

The  GEE  estimated  slope  for  the  time  effect  (on  the  logit  scale)  for  the  standard  drug  is 
0.4825,  with  empirical  SE  —  0. 1 199.  For  the  new  drug  the  slope  increases  by  1.0172,  with 
empirical  SE  =  0. 1877,  thus  giving  strong  evidence  of  a  faster  rate  of  improvement. 

12.2.3  Example:  Multinomial  GEE  Approach  for  Insomnia  Trial 

Liang  and  Zeger  (1986)  originally  specified  the  GEE  methodology  for  modeling  univariate 
marginal  distributions,  such  as  the  binomial  and  Poisson.  It  extends  to  marginal  model¬ 
ing  of  multinomial  responses.  Lipsitz  et  al.  (1994)  outlined  a  GEE  approach,  illustrating 
with  cumulative  logit  and  cumulative  probit  models.  With  this  approach,  for  each  pair  of 
outcome  categories  we  select  working  correlations  for  the  pairs  of  repeated  observations. 
Each  multinomial  response  at  a  fixed  observation  uses  the  (/  —  1)  x  (/  —  1)  multinomial 
covariance  matrix. 

We  illustrate  for  the  insomnia  data  of  Table  12.3,  with  Y,  =  time  to  fall  asleep  with 
treatment  x  at  occasion  f.  In  Section  12. 1.3  we  fitted  the  marginal  cumulative  logit  model 
(12.1),  which  is 


logit[P(T,  <  ./)]  =  aj  +  Pi t  +  p2x  +  p3(t  x  .v), 

using  ML.  Table  12.6  shows  results  using  the  GEE  approach  with  independence  working 
correlation  structure  (the  only  option  with  the  S  AS  software  employed).  The  GEE  estimates 
are  similar  to  the  ML  estimates  from  Section  12.1.3,  and  the  same  as  the  ML  estimates 
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Table  12.6  GEE  Results  for  Marginal  Model  for  Insomnia  Data 


Analysis  Of  GEE  Parameter  Estimates 


Empirical  Standard  Error  Estimates  Model-Based  Standard  Error  Estimates 


Parameter 

Estimate 

Std  Error 

Parameter 

Estimate 

Std  Error 

Interceptl 

-2.2671 

0.2188 

Interceptl 

-2 . 2671 

0.2027 

Intercept2 

-0.9515 

0 . 1809 

Intercept2 

-0 . 9515 

0 . 1785 

Intercept3 

0.3517 

0 . 1784 

Intercept3 

0 . 3517 

0 . 1727 

occasion 

1.0381 

0 . 1676 

occasion 

1 .0381 

0.2376 

treat 

0 . 0336 

0.2384 

treat 

0 . 0336 

0.2369 

treat*occasion 

0.7078 

0.2435 

treat* occasion 

0 . 7078 

0 .3342 

we’d  obtain  by  naively  treating  the  pairs  of  responses  as  independent,  that  is,  treating  the 
data  as  478  independent  observations  rather  than  as  matched-pairs  observations  for  239 
subjects.  There  is  a  positive  association  between  the  responses  at  the  two  times,  for  a 
given  treatment,  and  the  standard  errors  for  the  occasion  effect  for  placebo  ( 1 .038)  and  the 
treatment-by-occasion  interaction  (0.708)  are  overestimated  by  treating  the  observations 
as  independent.  (Recall  from  Section  1 1.1.4  that  positive  dependence  results  in  improved 
precision  for  estimating  within-subject  effects.)  For  the  indicator  coding  used,  the  treatment 
effect  (0.0336)  refers  to  the  initial  time  only;  thus,  it  is  based  on  two  independent  samples, 
and  the  empirical  adjustment  has  essentially  no  effect. 

The  substantive  conclusions  are  the  same  as  with  ML  fitting.  Again,  considerable  ev¬ 
idence  exists  that  the  distribution  of  time  to  fall  asleep  decreased  more  over  time  for  the 
treatment  group  than  for  the  placebo  group. 


12.3  QUASI-LIKELIHOOD  AND  ITS  GEE  MULTIVARIATE 
EXTENSION:  DETAILS 

A  GLM  assumes  a  certain  distribution  for  the  response  variable.  Sometimes  it  is  unclear 
how  to  select  it.  However,  often  there  is  a  plausible  relationship  between  the  mean  and 
variance,  such  as  v(/u.,)  =  cpjij  for  count  data.  Then,  an  alternative  to  ML  estimation  is 
quasi-likelihood  estimation  (Section  4.7).  We  next  present  some  details  about  this  method 
and  its  GEE  extension  for  marginal  modeling  of  multivariate  responses. 


12.3.1  The  Univariate  Quasi-likelihood  Method 

We  begin  with  models  for  a  univariate  response.  For  subject  i,  i  =  1 , . . . ,  n,  let  y,  be  the 
outcome  on  Y  with  /a,  =  E(Y,)  and  variance  function  v  (//,,),  and  let  xy  be  the  value  of 
explanatory  variable  j.  For  link  function  g ,  the  linear  predictor  is  77,  =  g(|U,  )  =  fijXy  = 
xj  j8.  The  quasi-likelihood  (QL)  parameter  estimates  /3  are  the  solutions  of  quasi-score 
equations 


u(fi)  = 


v(M/)  ‘(y/-Mi)  =  0, 


(12.3) 
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where  jlt,  =  g  1  (x[  fi).  These  estimating  equations  are  the  same  as  the  likelihood  equations 
(4.25)  for  GLMs  when  we  substitute 


9 fit  _  d/ij  dr)i 
dfij  drji  d fij 


They  are  not  likelihood  equations,  however,  without  the  extra  assumption  that  {y,}  has 
distribution  in  the  natural  exponential  family.  Under  that  assumption,  v(/r;)  characterizes 
the  distribution  within  the  natural  exponential  family  (Jprgensen  1 987).  Another  motivation 
for  equations  (12.3)  is  that  with  v(/n,)  replaced  by  known  variance  v,,  they  result  from  the 
weighted  least-squares  problem  of  minimizing  JA[(y,  —  /x,  )2/v,  ]. 

The  likelihood  equations  (4.25)  for  a  GLM  depend  only  on  the  mean  and  variance  of  [y, } 
and  the  link  function  g,  which  determines  3/i,/ dq,.  Thus,  Wedderbum  (1974)  suggested 
using  them  as  estimating  equations  for  any  link  and  variance  function,  even  if  they  do  not 
correspond  to  a  particular  member  of  the  natural  exponential  family. 

12.3.2  Properties  of  Quasi-likelihood  Estimators 

In  the  quasi-likelihood  (QL)  method,  the  quasi-score  function  Uj(fi)  in  (12.3)  is  called  an 
unbiased  estimating  function;  this  term  refers  to  any  function  h(y;  fi)  of  y  and  ft  such 
that  E[h(Y ;fi)]  =  0  for  all  ft.  The  equations  (12.3)  that  determine  ft  are  called  estimating 
equations. 

The  quasi-likelihood  method  treats  the  quasi-score  function  as  the  derivative  of  a  function 
called  the  quasi-log  likelihood.  This  function  may  not  be  a  proper  log-likelihood  function. 
Nonetheless,  McCullagh  (1983)  showed  that  QL  estimators  have  properties  similar  to  those 
of  ML  estimators:  Under  correct  specification  of  the  mean  and  the  variance  function, 
they  are  asymptotically  efficient  among  estimators  that  are  locally  linear  in  [y,].  This 
result  generalizes  the  Gauss-Markov  theorem,  although  in  an  asymptotic  rather  than  exact 
manner.  The  QL  estimators  fi  are  asymptotically  normal  with  model-based  covariance 
matrix  approximated  by 


V  = 


[v(/x, )]  1 


(12.4) 


This  is  equivalent  to  the  formula  for  the  large-sample  covariance  matrix  of  the  ML  estimator 
in  a  GLM  [which  is  estimated  by  (4.31)]. 

A  key  result  is  that  the  QL  estimator  ft  is  consistent  for  fi  (i.e.,  fi  -4-  fi)  even  if  the 
variance  function  is  misspecified,  as  long  as  the  specification  is  correct  for  the  link  function 
and  linear  predictor.  That  is,  assuming  that  the  model  form  g(n,)  =  Y,  fijxij  >s  correct, 
the  consistency  of  fi  holds  even  if  the  true  variance  function  is  not  v(/r,).  We  now  give  a 
heuristic  explanation  for  this. 

When  truly  /r,  =  g"1  ( Jf  j  fijxij )>  then  from  (12.3),  E[u  j(fi)]  —  0  for  all  j.  From  (12.3), 
u(fi)/n  is  a  vector  of  sample  means.  By  a  law  of  large  numbers,  it  converges  in  probability 
to  its  expected  value  of  0.  But  solution  fi  of  the  quasi-score  equations  is  the  value  of  fi 
for  which  the  sample  mean  is  exactly  equal  to  0.  Since  fi  is  a  continuous  function  of  these 
sample  means,  it  converges  to  fi.  The  consistency  also  follows  from  general  results  for 
unbiased  estimating  functions  (see  Note  12.4). 


QUASI-LIKELIHOOD  AND  ITS  GEE  MULTIVARIATE  EXTENSION:  DETAILS 


467 


12.3.3  Sandwich  Covariance  Adjustment  for  Variance  Misspecification 

If  we  assume  that  var(Y,  )  =  v (/*,•)  but  the  true  var(T,)  /  v(^,),  then  the  actual  asymptotic 
covariance  matrix  of  the  QL  estimator  fi  is  not  V  as  given  in  (12.4).  Instead,  it  is  (Diggle 
etal.  2002,  White  1982) 


V 


n  / 

J2  [v(M,)]_1var(r,)[v(M,)]-' 


V. 


(12.5) 


Even  though  the  variances  are  scalar,  we  express  the  matrices  in  this  form  to  motivate  the 
GEE  multivariate  extension  discussed  below.  Matrix  (12.5)  simplifies  to  V  if  var(E,)  = 
v(Hi).  In  practice,  the  true  variance  function  is  unknown.  A  consistent  estimator  of  (12.5)  is 
a  sample  analog,  replacing  jU,  by  fi,  and  var(K, )  by  (y,  —  fi,  )2  (Liang  and  Zeger  1986).  The 
estimated  covariance  matrix  is  valid  regardless  of  whether  the  variance  specification  v (/x,- ) 
is  correct.  This  estimated  covariance  is  called  a  sandwich  estimator,  because  the  empirical 
evidence  is  sandwiched  between  the  model-based  covariance  matrices. 

To  illustrate,  for  a  sample  of  counts  (y,|,  consider  the  common  mean  model,  /u,,  =  /?, 
i  =  1 , ,n.  Suppose  we  assume  that  v  (//,•)  —  M/ ,  as  in  a  Poisson  model,  but  actually 
var(T,)  =  fij.  Since  9/U,/ 3/3  =  1,  from  (12.3), 


v(M;)  '(>V  -  Mi)  =  E 


(y<  -  di) 

Mi 


=  E 


(yt  ~  P) 
P  ' 


Setting  this  equal  to  0  and  solving,  $  =  (£T  y,)/n  =  y.  So,  the  model-based  variance 
(12.4)  simplifies  to 


V  = 


[v(M/)]  ' 


fi 

n 


The  actual  asymptotic  variance  (12.5)  of  fi  —  y  that  takes  into  account  the  variance  mis¬ 
specification  is 


E 


9 Mi 

9  fi 


v(M/)]  1  var(K,)[v(^t,)] 


-I 


/9m,\ 

V9/6/ 


V  = 


E-l  2  - 
M,  Hi  Hi 


In  practice,  we  replace  the  true  variance  /ij  in  this  expression  by  (y,  —  y)2,  so  the  last 
expression  simplifies  (using  /ir,  =  P)  to  £T(y,  —  y)2/n2,  a  quite  sensible  estimate  of  the 
variance  of  a  sample  mean.  Using  this  sandwich  estimator  instead  of  V  =  0/n  =  y/n 
protects  against  an  incorrect  choice  of  variance  function. 

The  purpose  of  the  sandwich  estimator  is  to  use  the  data’s  empirical  evidence  about 
variation  to  adjust  the  standard  errors  in  case  the  true  variance  differs  substantially  from  the 
working  guess.  Inference  uses  the  asymptotic  normality  of  the  estimators  together  with  the 
sandwich-estimated  covariance  matrix.  A  significance  test  of  Hq:  =  0  using  test  statistic 
z  =  Pj/SE  (or  its  square)  and  a  95%  confidence  interval  Pj  ±  1 .96 (SE)  provide  Wald-type 
inference. 
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In  summary,  even  with  incorrect  specification  of  the  variance  function,  we  can  still 
consistently  estimate  /?  and  estimate  the  asymptotic  covariance  of  /}  by  estimating  the 
sandwich  adjustment  (12.5).  However,  some  efficiency  loss  occurs  when  the  variance 
chosen,  v(/r,),  is  wildly  inaccurate.  Also,  the  number  of  clusters  n  may  need  to  be  large 
for  the  sample  version  of  (12.5)  to  work  well;  otherwise,  the  empirically  based  standard 
errors  tend  to  underestimate  the  true  ones  (e.g.,  Firth  1993b).  As  estimators,  those  standard 
errors  can  also  show  more  variability  than  parametric  estimators  (Kauermann  and  Carroll 
2001 ).  Boos  ( 1992)  proposed  analogs  of  score  tests  that  solve  the  GEE  under  the  restriction 
that  the  null  holds  and  which  incorporate  empirical  variance  estimates,  illustrating  with 
tests  for  trend  and  lack  of  fit  in  binary  regression.  See  also  Lefkopoulou  et  al.  (1989)  and 
Rotnitzky  and  Jewell  (1990).  Finally,  in  practice  we  must  recognize  that  just  as  the  variance 
function  chosen  only  approximates  the  true  one,  so  is  the  specification  for  the  mean  only 
approximate. 


12.3.4  GEE  Multivariate  Methodology:  Technical  Details 

Now  we  consider  the  generalized  estimating  equations  (GEE)  multivariate  version  of  QL. 

For  cluster  /,  let  y y  =  (y,- . yiTf  and  /t,  =  (yu./ j ,  . . .,  yU,T,.)r,  where  m,  =  E(Yit).  The 

number  7)  of  responses  may  vary  by  cluster.  Let  x„  denote  a  p  x  1  vector  of  explanatory 
variable  values  for  ylt.  The  notation  allows  also  values  of  the  explanatory  variables  to  vary 
for  the  observations  in  a  cluster.  The  linear  predictor  of  the  model  is  =  g(/r,r)  =  xJ,P 
for  link  function  g.  The  model  refers  to  the  marginal  distribution  at  each  1  rather  than  the 
joint  distribution.  Let  Xt  be  the  7}  x  p  matrix  of  explanatory  variable  values  for  cluster  i, 
for  which  row  t  is  xfr 

We  assume  that  yit  has  probability  mass  function  of  form 


f(yn’,0i„  <t>)  =  exp{[>’„  9,,  -  b{9,,)}/(t>  +  c(y„,  0)}. 


When  <p  is  known,  this  is  the  natural  exponential  family  with  natural  parameter  0„.  From 
Section  4.4.2, 


=  E(Yj,)  =  b\9ir),  v(/r„)  =  var(K„)  =  b"{9„)cp. 

The  GEE  method  also  assumes  a  working  correlation  matrix  R{a)  for  Y,,  depending  on 
parameters  a.  The  exchangeable  working  correlation  has  corr( Y„ ,  Y,-s)  =  a  for  each  pair 

in  Y j .  Let  b,(0)  =  {b{9j\), - fc(0,/-:)),  and  let  B ,  denote  a  diagonal  matrix  with  main- 

diagonal  elements  b"(6).  Then  the  working  covariance  matrix  for  T,  is 

Vi  =  B1/2  R(a)B]/2(/>.  (12.6) 

If  R  is  the  true  correlation  matrix  for  T, ,  then  V,  =  cov(T,) . 

Let  A,  be  the  diagonal  matrix  with  elements  90„/9??„  on  the  main  diagonal  for 

t  =  1 . Tj.  (For  the  canonical  link,  this  is  the  identity  matrix.)  Let  L>,  =  9/t,/9j8  = 

Bi\iXj  be  a  T,  x  p  matrix  with  typical  element  expressing  9/r,,/9/l/  in  the  form 
(dni,ld9j,)(d9jt/di~iit)(drij,/df5j).  From  (12.3).  for  univariate  GLMs  the  quasi-likelihood 
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estimating  equations  have  the  form 


^2(dfMi/dp)Tv(fii)  1  [^/  -  (ii( &)]  =  0, 

/=! 

where  fMj  =  /u,, (>8)  =  g~l(xj jS).  The  analog  of  this  in  the  multivariate  case  is  the  set  of 
generalized  estimating  equations 


J2DJvr'l  yi-inm=o.  (12.7) 

i=i 

The  GEE  estimator  /?  is  the  solution  of  these  equations. 

The  approach  that  sets  R(a)  —  I  treats  clustered  responses  as  independent.  In  that  case, 
(12.6)  simplifies  to  V,  =  B/0,  and  the  generalized  estimating  equations  simplify  to 


J2DJv7'[y.  -mm  =  -gi,m 

i=\  1=1 

n 

=  (\/<P)J2xJ^[yi  -  fnm  =  o. 

i  =  l 

The  solution  ff  is  then  the  same  as  the  ordinary  ML  estimator  for  a  GLM  with  the  chosen 
link  function  and  variance  function,  treating  (y(1.  •  •  • ,  y,T, )  as  independent  observations. 

Normally,  we  select  a  working  correlation  matrix  permitting  dependence,  such  as  the 
exchangeable  structure.  Liang  and  Zeger  (1986)  suggested  computing  the  GEE  estimates  by 
iterating  between  a  modified  Fisher  scoring  algorithm  for  solving  the  generalized  estimating 
equations  for  /J  (given  current  estimates  of  a  and  0)  and  using  residuals  for  moment 
estimation  of  a  and  0  (based  on  the  current  estimates  of  j8).  They  suggested  estimates  of 
R(a)  for  a  variety  of  correlation  structures.  Alternative  algorithms  simultaneously  solve 
estimating  equations  for  fi  and  for  association  parameters.  See  Liang  et  al.  ( 1 992)  and  Note 
12.5. 

Liang  and  Zeger  (1986)  showed  asymptotic  normality  and  consistency  as  the  number  of 
clusters  n  increases.  Under  certain  regularity  conditions  including  appropriate  consistency 
for  estimates  of  a  and  0, 


■Jn  (f!  —  fi)  N (0,  VG). 

Here,  generalizing  (12.5),  VG  —  lim,,.^  V c,„  with 


v  G.n  —  n 


ED‘V,  1  D, 


L  f  =  l 


J  L/=i 


EDjVT'coviYiWr'Dj 


EdJv-'d, 


,i=  1 


-I 


The  estimated  sandwich  covariance  matrix  (1  /n)Vc,n  of  j8  replaces  j8  with  /?,  0  with  0,  a 
with  a,  and  cov(T,)  by  [y,  -  fii(0)][y,  ~  Hi(P)]T ■ 
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When  the  working  correlation  structure  is  the  true  one  and  cov(y,)  =  V,,  the  asymp¬ 
totic  covariance  matrix  (\/n)Vc„  simplifies  to  the  model-based  covariance  matrix, 
(E.DjVr'D.y  This  is  the  relevant  covariance  matrix  if  we  put  complete  faith  in 
our  choice  of  the  correlation  structure. 


12.3.5  Working  Associations  Characterized  by  Odds  Ratios 

With  binary  data,  the  correlation  is  not  the  best  way  to  characterize  within-cluster  associa¬ 
tion.  The  marginal  probabilities  constrain  the  possible  correlation  values,  since  the  range  of 
possible  values  for  E(YirYjS )  =  P{Ylt  =  I ,  Y,s  =  1)  depends  on  P ( Y-„  =  1)  and  P(Y;S  =  1). 
In  particular,  it  is  not  possible  to  have  a  strong  correlation  when  the  marginal  probabilities 
vary  greatly.  Also,  the  correlations  typically  depend  on  values  of  explanatory  variables. 
Although  a  structure  such  as  exchangeable  correlations  may  be  plausible  for  underlying 
continuous  latent  variables,  it  is  usually  not  so  for  observed  binary  response  variables.  In 
some  cases,  working  correlation  matrices  may  not  have  valid  joint  distributions  with  that 
structure,  and  there  can  be  a  breakdown  in  the  asymptotic  properties  of  the  correlation 
and  model  parameter  estimators;  see  Chaganty  and  Joe  (2004,  2006),  Crowder  (1995),  and 
Touloumis  (201 1). 

An  alternative  approach  uses  the  odds  ratio  to  characterize  pairwise  associations.  For 
instance,  we  can  model  the  log  odds  ratios  for  pairs  of  observations  in  a  cluster  as  exchange¬ 
able,  and  use  the  odds  ratios  together  with  the  marginal  probabilities  to  specify  working 
correlation  matrices.  This  approach  also  has  the  advantages  that  the  association  parameters 
are  distinct  from  the  means  and  that  the  working  correlations  can  depend  on  values  of 
explanatory  variables.  See  Fitzmaurice  et  al.  ( 1993)  and  Lipsitz  et  al.  (1991). 

Carey  et  al.  (1993)  suggested  an  iterative  alternating  logistic  regressions  algorithm.  It 
alternates  between  a  GEE  step  for  the  regression  parameters  in  the  model  for  the  mean  and 
a  step  for  an  association  model  for  the  log  odds  ratio.  This  is  especially  useful  when  the 
structure  of  the  association  is  itself  a  major  focus  rather  than  a  nuisance,  as  the  method  also 
provides  standard  errors  for  the  estimates  for  the  association  model. 

We  illustrate  for  the  GEE  analysis  of  the  depression  study  of  Table  12.1.  Using  an 
exchangeable  odds  ratio  for  pairs  of  times,  we  obtain  a  common  log  odds  ratio  estimate 
of  —0.007  with  SE  =  0. 162.  The  parameter  estimates  and  standard  errors  are  the  same  to 
three  decimal  places  as  in  the  correlation-based  analysis  of  Section  12.2.2. 


12.3.6  GEE  Approach:  Multinomial  Responses 

Lipsitz  et  al.  (1994)  developed  the  GEE  approach  for  marginal  modeling  with  a  multi¬ 
nomial  response.  Let  y ,,(./')  =  1  if  observation  t  in  cluster  i  has  outcome  j  ( j  = 
1 ,...,/  —  1).  Let  y i  be  the  T,(I  —  1)  binary  indicators  for  cluster  i.  Then,  the  cho¬ 
sen  [Tj(I  —  1)]  x  [7j(/  —  1)]  working  covariance  matrix  V,  for  y,  specifies  a  pat¬ 
tern  for  corri  Y <,(]).  Y,s(k))  for  each  pair  of  outcome  categories  (J,  k)  and  each  pair 
(t,  s).  The  (/  —  1)  x  (/  —  1)  block  of  V j,  for  (y,v(l),  ...,y„(/  —  1))  is  a  multinomial 
covariance  matrix  with  v,,(j)=  P(Yit{j)=  1 )[  1  —  P{Yit(j)  —  1)]  on  the  main  diago¬ 
nal  and  — P(Yit(j )  =  1  )P(Yit(k)  =  1)  off  it.  The  remaining  elements  of  V,  contain 
elements  cov(T„(7),  T, ,(&)).  For  instance,  one  possibility  is  the  exchangeable  struc¬ 
ture,  corr(F„(  /),  Y-ls(k))  =  p,*  for  all  t  and  s.  An  alternative  approach  specifies  working 
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associations  based  on  multinomial  odds  ratios  such  as  the  set  of  local  odds  ratios  (Touloumis 

2011). 

The  generalized  estimating  equations  for  p  again  have  the  form 


«w  =  XXvr,(j'.--Mi)=o, 

1=1 


where  ft,  is  the  vector  of  probabilities  associated  with  y,,  and  Dj  =  3 fij /dp.  Lipsitz  et  al. 
(1994)  suggested  a  Fisher  scoring  algorithm  for  solving  these  equations  and  a  method  of 
moments  update  for  estimating  {py*}  at  each  step  of  the  iteration.  An  empirically  adjusted 
sandwich  covariance  matrix  of  p  is  again 


E» 


TV-'Di 


1  r  « 


ED'V> 


'co  y(Yi)V-'Di 


X> 


tV~1Di 


L  1=1  J  L  1=1 

This  is  estimated  by  substituting  (L,  from  the  model 
empirical  covariance  matrix  of  y,. 


J  L  i= l  J 

fit  and  replacing  cov(T,)  by  the 


12.3.7  Dealing  with  Missing  Data 

Unfortunately,  studies  with  repeated  measurements  often  have  cases  for  which  at  least  one 
response  in  a  cluster  is  missing.  In  a  longitudinal  study,  for  instance,  some  subjects  may  drop 
out  before  its  conclusion.  This  often  results  in  a  monotone  missingness  pattern,  in  which  if 
y,  was  observed  then  so  necessarily  was  y,_i ,  and  if  y,  was  missing  then  so  necessarily  was 
y,+  i .  When  data  are  missing,  biased  estimates  can  result  from  either  analyzing  the  observed 
data  as  if  no  data  are  missing  or  from  analyzing  the  data  file  that  deletes  entire  clusters  that 
have  at  least  some  missingness. 

We  partition  the  data  Y  into  those  that  are  observed,  Y(v>,  and  those  that  are  missing, 
Y(rn>.  Let  M  denote  a  vector  of  missing  data  indicators  that  equal  I  when  an  observation 
is  missing  and  0  otherwise.  Little  and  Rubin  (2002)  called  the  data  missing  completely 
at  random  (MCAR)  if  M  is  statistically  independent  of  Y ;  that  is,  the  probability  that  an 
observation  is  missing  is  independent  of  that  observation’s  value  and  the  values  of  other 
variables  in  the  data  set.  In  this  case,  Y(m>  behaves  like  a  random  sample  from  Y .  Less 
restrictively,  they  called  the  data  missing  at  random  (MAR)  if  the  distribution  of  (M\Y) 
equals  that  of  (M|T(|7));  that  is,  what  caused  the  data  to  be  missing  does  not  depend  on  the 
data  itself.  For  example,  in  a  longitudinal  study,  if  whether  someone  drops  out  of  the  study 
depends  on  values  observed  prior  to  the  drop-out  but  not  the  later  unobserved  values,  the 
data  are  MAR. 

Consider  a  study  such  as  in  Section  12.1.1,  modeling  depression  as  a  function  of  treat¬ 
ment,  severity,  and  time.  If  the  probability  that  the  depression  assessment  is  missing  is  the 
same  for  all  subjects  regardless  of  treatment,  severity,  and  time,  then  the  data  are  MCAR. 
If  the  probability  the  depression  assessment  is  missing  varies  according  to  time  but  does 
not  vary  according  to  the  depression  assessment  of  subjects  at  the  same  treatment,  severity, 
and  time,  then  the  data  are  not  MCAR  but  are  MAR.  They  are  not  MAR  if  those  with  a 
missing  depression  assessment  tend  to  have  worse  depression  than  those  not  missing  the 
assessment,  controlling  for  the  other  variables. 
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In  practice,  we  cannot  test  whether  MCAR  or  MAR  is  satisfied,  because  we  do  not 
know  the  values  of  the  missing  data,  However,  certain  evidence  can  show  that  they  are 
not  satisfied.  For  example,  suppose  the  subjects  classified  as  severe  in  their  depression 
symptoms  tended  to  be  much  more  likely  to  have  missing  depression  observations  than 
those  classified  as  mild.  Then,  the  missing  data  do  not  seem  to  be  MCAR,  because  the 
missing  observations  do  not  resemble  a  random  sample  of  all  the  observations. 

When  either  MCAR  or  MAR  is  plausible,  with  a  likelihood-based  analysis  it  is  not  neces¬ 
sary  to  model  the  missingness  mechanism.  An  analysis  using  only  Y to)  is  not  systematically 
biased.  For  ML  fitting,  we  treat  the  missing  data  as  random  variables  to  be  integrated  out  of 
the  likelihood  function  using  the  EM  algorithm  (Section  13.6.3).  Exercises  12.12  and  1 2.29 
illustrate  how  this  can  be  done  simply  for  a  contingency  table  having  monotone  missing¬ 
ness.  An  alternative  to  ML  uses  multiple  imputation.  This  is  a  Monte  Carlo  method  in  which 
the  missing  values  are  replaced  several  times  by  simulated  versions  from  their  conditional 
distribution,  given  the  observed  data  (see  Rubin  1996).  Each  simulated  complete  data  set  is 
analyzed  with  standard  methods,  and  the  results  are  then  combined.  The  resulting  estimates 
have  standard  errors  based  on  the  within-imputation  and  between-imputation  variances, 
thus  incorporating  the  missing-data  uncertainty.  This  method  is  most  naturally  applied  in  a 
Bayesian  context  that  also  treats  parameters  as  random. 

With  the  GEE  method,  different  clusters  can  have  different  numbers  of  observations. 
The  data  input  file  has  a  separate  line  for  each  observation,  and  for  longitudinal  studies, 
computations  use  those  times  for  which  a  subject  has  an  observation.  However,  bias  can 
arise  in  GEE  estimates  when  data  are  missing  unless  the  data  are  MCAR.  The  missingness 
can  then  be  ignored  and  an  analysis  using  the  observed  data  only  is  valid.  In  the  MAR  case, 
it  is  valid  when  estimating  equations  can  be  weighted  by  response  probabilities  (Robins 
et  al.  1995).  Otherwise,  however,  with  GEE  and  other  non-likelihood-based  methods,  the 
missingness  process  cannot  be  ignored  even  in  the  MAR  case.  Kenward  et  al.  (1994) 
illustrated  the  potential  breakdown  in  GEE  estimates  when  the  data  are  not  MCAR. 

Often,  missingness  is  not  MCAR  or  MAR  but  rather  is  informative  and  cannot  be 
ignored.  For  instance,  in  a  longitudinal  study  measuring  pain,  perhaps  a  subject  dropped 
out  when  the  pain  got  above  some  threshhold.  Then,  more  complex  analyses  are  needed 
that  model  the  missingness  as  well  as  the  complete  data.  That  is,  methods  require  a  joint 
distribution  for  Y  and  M  (Little  2005).  Let  /(•)  denote  a  generic  probability  mass  function, 
which  also  depends  on  explanatory  variables  x  and  parameters.  Selection  models  factor  the 
joint  distribution  of  Y  and  M  as 

f(y,M;x ,  j8,  f )  =  f(y;x ,  P)f{M\y;  x,  f), 

where  f(y;  x,  fl)  is  the  model  in  the  absence  of  missing  values  and  / ( M\y ;  x,  ijr)  is  the 
model  for  the  missing-data  mechanism,  such  as  a  logistic  model  for  M  that  selects  dropouts 
according  to  their  history  of  previous  responses.  Pattern  mixture  models  use  the  alternative 
factorization. 


f{y,  M\ x,  0,  0)  =  f(y\M,  x ,  0)/(M;  *,  6), 

which  conditions  the  distribution  of  Y  on  the  missing  data  pattern.  The  two  specifications 
are  equivalent  when  M  is  independent  of  Y ,  with  fi  =  <f>  and  i/r  —  6.  For  discussion  of 
advantages  of  each  modeling  approach  and  details  on  ways  of  modeling  missingness,  see 
Little  (2005)  and  references  in  Note  12.6. 
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Analyses  in  the  presence  of  much  missingness  should  be  made  with  caution.  Typically, 
little  is  known  about  the  missing  data  mechanism,  and  assumptions  about  it  cannot  be 
checked.  Since  inferences  may  not  be  robust,  a  sensitivity  study  is  necessary  to  check 
how  results  depend  on  specification  of  that  mechanism.  In  the  absence  of  a  model  for  the 
missingness,  we  should  at  least  compare  results  of  the  analysis  using  all  available  cases  for 
all  clusters  to  the  analysis  using  only  clusters  having  no  missing  observations.  If  results 
differ  substantially,  conclusions  should  be  very  tentative  until  the  reasons  for  missingness 
can  be  studied. 


12.4  TRANSITIONAL  MODELS:  MARKOV  CHAIN  AND 
TIME  SERIES  MODELS 

For  responses  T,,  t—  0,1,2,...,  the  indexed  family  of  random  variables  (To,  Y\, 
T2,  T2, . . .)  is  called  a  stochastic  process.  Its  state  space  is  the  set  of  possible  values 
for  T,.  When  the  state  space  is  categorical,  {T,j  has  discrete  state  space.  Often,  the  main 
focus  is  on  the  dependence  of  Y,  on  the  other  responses  as  well  as  any  explanatory  variables. 
Models  of  this  type  are  called  transitional  models,  as  they  describe  the  transition  to  the  state 
for  Y,  from  the  states  at  other  observations.  When  t  is  a  time  index  and  the  dependence  of 
T,  on  other  responses  is  modeled  solely  in  terms  of  responses  { Vo ,  yi , . . . ,  yt-\ )  observed 
previously,  the  model  is  referred  to  as  a  time  series  model. 

Let  /(y0,  ■  ■  . ,  yr)  denote  the  joint  probability  mass  function  of  (T0, . . . ,  Yj)  (ignoring, 
for  now,  explanatory  variables).  Transitional  models  for  time  series  data  use  the  factorization 

/(To-  •  •  • ,  Tr)  =  ,/'(.Vo)/(.Vi  I.Vo)/(.V2 1.Vo-  ,v  1 )  •  •  •  /(TrlTo,  Ti -  •  •  • .  Tr-i)- 

Unlike  the  marginal  models  in  the  other  sections  of  this  chapter,  this  modeling  is  conditional 
on  previous  responses. 


12.4.1  Markov  Chains 

A  Markov  chain  is  a  simple  stochastic  process  having  discrete  state  space  for  which,  for 
all  t,  the  conditional  distribution  of  T,,  given  To, ... ,  T, _  1 ,  is  identical  to  the  conditional 
distribution  of  T,  given  T,_  1  alone.  That  is,  given  T,_  1,  T,  is  conditionally  independent  of 
To, ... ,  T,_2-  Knowing  the  present  state  of  a  Markov  chain,  information  about  past  states 
does  not  help  us  predict  future  states.  For  Markov  chains, 

/(To,  •  • Tr)  =  /(To)/(TilTo)/(T2lTi)-  •  •  /(TrlTr-i).  (12.8) 

Many  transitional  models  have  Markov  chain  structure. 

For  a  Markov  chain,  denote  the  conditional  probability  P(Y,  —  y|T,_i  =  /)  by  7Ty|, (r). 
The  {jry|, ■(!)},  which  satisfy  •  7ij\,(t)  =  1,  are  called  transition  probabilities.  The  7x7 
matrix  {7Tj|/(f),  i  =  I ,...,/ .  /  =  1 , . . . ,  7 )  is  a  transition  probability  matrix.  From  (12.8), 
the  joint  distribution  for  a  Markov  chain  depends  only  on  one-step  transition  probabilities 
and  the  marginal  distribution  for  the  initial  state  To-  It  follows  that  the  joint  distribution 
satisfies  loglinear  model 


(T0T,,  T,T2, . . . ,  YT-\YT). 


474 


CLUSTERED  CATEGORICAL  DATA:  MARGINAL  AND  TRANSITIONAL  MODELS 


For  a  sample  of  realizations  of  a  discrete-time  Markov  chain,  a  contingency  table  displays 
counts  of  the  possible  sequences.  A  test  of  fit  of  this  loglinear  model  checks  whether  the 
process  plausibly  satisfies  the  Markov  property. 

Statistical  inference  for  Markov  chains  uses  standard  methods  of  categorical  data  anal¬ 
ysis.  For  example,  consider  ML  estimation  of  transition  probabilities.  Let  n^t)  denote  the 
number  of  transitions  from  state  i  at  time  /  —  1  to  state  j  at  time  t.  For  fixed  t,  \n,j(t)\ 
form  the  two-way  marginal  table  for  dimensions  t  —  1  and  t  of  an  /r+1  contingency  table. 
For  the  /?,+(/)  subjects  in  category  i  at  time  t  —  1 ,  suppose  that  {/2y(t),  j  =  1 ,...,/ )  have 
a  multinomial  distribution  with  parameters  {7ry-|,-(/)}.  Let  {/?,o)  denote  the  initial  counts. 
Suppose  that  they  also  have  a  multinomial  distribution,  with  parameters  If  subjects 
behave  independently,  from  (12.8)  the  likelihood  function  is  proportional  to 


. ])• 


(12.9) 


The  transition  probabilities  are  parameters  of  IT  independent  multinomial  distributions. 
From  Anderson  and  Goodman  (1957),  the  ML  estimates  are 


Ttj\i(t)  =  njj(t)/ni+(t). 


Many  models  assume  that  the  transition  probabilities  are  stationary.  For  all  i  and  j, 
tt y|/(  1 )  =  ttj\j(2)  =  ■■■  =  TtjiiiT)  --  njv. 

Let  n,j  =  n,j{t).  Under  the  assumption  of  stationary  transition  probabilities,  the  likeli¬ 

hood  in  (12.9)  simplifies,  and  the  ML  estimators  are 

tt  j\i  —  n,j/ n,+. 

These  results  generalize  to  more  complex  dependences.  For  example,  a  stochastic  process 
is  a  kth-order  Markov  chain  if,  for  all  /,  the  conditional  distribution  of  Y, ,  given  T0, . . . ,  T,_  (, 
is  identical  to  the  conditional  distribution  of  T,.  given  (Tf_i, . . . ,  T,-*).  Given  the  states  at 
the  previous  k  times,  the  future  behavior  of  the  chain  is  independent  of  past  behavior  before 
those  k  times.  The  Markov  chain  as  defined  above  is  first-order. 


12.4.2  Example:  Changes  in  Evapotranspiration  Rates 

Fokianos  and  Kedem  (2002,  p.  39)  analyzed  a  time  series  that  consists  of  a  series  of  84 
indicators  about  monthly  changes  in  evapotranspiration  (evaporation  plus  transpiration) 
rates,  compared  with  a  year  earlier,  at  a  location  in  southern  Israel.  At  a  particular  time  t, 
y,  =  1  if  the  change  from  a  year  ago  is  greater  than  the  average,  and  y,  =  0  if  it  less  than 
average.  In  order  of  the  84  months,  they  presented  the  data: 


111111110001111100000001111000100000111111 
1  1001  100000001  1  1  1  1  10000000001  1  1  1  1  1  10000000 
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Table  12.7  Summary  of  Two-Step 
Transitions  for  Evapotranspiration  Data 


y,~ , 

Y,-2 

Y, 

1 

0 

1 

1 

26 

7 

0 

6 

1 

0 

1 

0 

8 

0 

7 

27 

The  logistic  regression  model  with  the  autoregressive  structure 


logit[/>(T,  =  l|y,_i,  y,-2,  ■  ■  ■ ,  >0)]  =  a  +  P\y,-\ 


is  a  first-order  Markov  chain  with  stationary  transition  probabilities.  A  second-order  Markov 
chain  with  stationary  transition  probabilities  has 


logit[/>(T,  =  l|y,-i,  yt-2, ....  v0)]  =  ot  +  Piyt-i  +  fayt-i. 

If  the  second-order  model  holds,  it  is  sufficient  to  fit  the  model  to  the  data  as  summarized 
in  Table  1 2.7,  treating  the  data  as  four  independent  binomial  variates.  Using  standard  logistic 
software,  we  get  a  deviance  of  1.47  (df  =  1),  and  estimates  (a,  p\ ,  $2)  =  (—1-439,  3.970, 
-1.304)  with  SE  values  (0.427,  1.078,  1.078).  This  suggests  that  the  simpler  first-order 
model  may  be  adequate.  Fitting  the  model  with  constraint  pi=0  increases  the  deviance  by 
1.98,  with  (a,  0\)  =  (—1.609,  2.996)  having  SE  values  (0.414, 0.572).  Here,  necessarily 
equals  the  log  odds  ratio  for  the  collapsed  2x2  table  relating  y,  and  y,_i .  The  interpretation 
is  that  the  estimated  odds  that  y,  =  1  is  exp(2.996)  =  20  times  as  high  when  y,_i  =  1  as  it  is 

when  Vf _ 1  =  0.  For  the  first-order  model,  we  can  obtain  an  extra  observation  by  considering 

the  pair  at  times  0  and  1 ;  then,  fij  changes  to  3.027.  Fitting  the  model  permitting  dependence 
higher  than  second-order  provides  little  improvement  in  the  deviance. 

12.4.3  Transitional  Models  with  Explanatory  Variables 

Transitional  models  can  also  include  explanatory  variables  x.  The  joint  mass  function  of  T 
sequential  responses  is  then 

f(y  1,  ...,yT;x) 

=  f(y  1 ; *)/( v2| vi ;  *)/(y3|yi ,  yi\ x)--  f(yT |>’i ,  yi,  ■  ■  ■ ,  yr-i ; x). 

More  generally,  x  may  take  a  different  value  for  each  component,  such  as  when  covariates 
are  time  dependent. 

With  binary  y,  we  can  specify  a  logistic  regression  model  for  each  term  in  this  factor¬ 
ization.  An  example  is  the  Ath-order  model 

/(v(|y . . 

_  exp[yf (a  +  ffi  v,_i  -1 - F  +  PTxt)] 

1  -(-  exp(a  +  f}\yt-\  +  ■  ■  ■  +  fayt-k  +  PTx,Y 


yt  =0, 1, 
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which  also  treats  k  previous  responses  as  explanatory  variables.  It  is  called  a  regressive 
logistic  model  (Bonney  1987).  In  the  special  case  of  first-order  Markov  structure,  the 
coefficients  of  {yi, . . . ,  _y,_ 2}  equal  0  in  the  model  for  y,  (Azzalini  1994,  Bonney  1987).  It 
may  also  help  to  allow  interaction  between  x,  and  y,_i  in  their  effects  on  y,. 

The  interpretation  and  magnitude  of  /?  depends  on  how  many  previous  observations  are 
in  the  model.  Normally  we’d  use  the  same  number  k  of  previous  observations  in  each  term. 
Within-cluster  effects  may  diminish  markedly  by  conditioning  on  previous  responses.  This 
is  an  important  difference  from  marginal  models,  for  which  the  interpretation  does  not 
depend  on  the  specification  of  the  dependence  structure. 

For  a  given  subject,  the  product  of  the  conditional  mass  functions  over  t  determines  that 
subject’s  contribution  to  the  likelihood  function.  It  is  common  to  ignore  the  contribution 
of  the  marginal  distribution  for  the  first  term.  So,  given  the  predictor,  model-fitting  treats 
repeated  transitions  by  a  subject  as  independent.  Thus,  we  can  fit  the  model  with  ordinary 
GLM  software,  treating  each  transition  as  a  separate  observation.  Even  when  x  is  also 
treated  as  random,  Fokianos  and  Kedem  (2002)  formed  a  likelihood  function  using  only 
this  product  of  conditional  mass  functions  for  jy, ).  They  termed  it  a  partial  likelihood , 
showed  theoretical  properties,  and  argued  that  relatively  little  information  is  lost  using  it. 

More  generally,  Fahrmeir  and  Kaufmann  (1987)  and  Fokianos  and  Kedem  (2002, 
2003)  proposed  a  generalized  linear  model  structure  in  which  a  link  function  applied 
to  E(Yt)  is  modeled  as  a  linear  function  of  past  observations  and  random  time-dependent 
explanatory  variables,  without  assuming  Markov  structure  or  stationarity.  Moreover,  Y, 
can  be  multivariate,  such  as  a  set  of  /  —  1  indicators  for  a  /-category  multinomial  re¬ 
sponse.  See  Note  12.8  and  Section  13.3.9  for  other  approaches  for  categorical  time 
series  data. 

12.4.4  Example:  Child’s  Respiratory  Illness  and  Maternal  Smoking 

Table  12.8  is  from  a  Harvard  study  of  air  pollution  and  health.  At  ages  7  through  10, 
children  were  evaluated  annually  on  the  presence  of  respiratory  illness.  A  predictor  is 
maternal  smoking  at  the  start  of  the  study,  where  s  =  1  for  smoking  regularly  and  5  =  0 


Table  12.8  Child’s  Respiratory  Illness  by  Age  and  Maternal  Smoking 


Child’s  Respiratory  Illness 

No  Maternal 
Smoking 

Age  10 

Maternal 

Smoking 

Age  10 

Age  7 

Age  8 

Age  9 

No 

Yes 

No 

Yes 

No 

No 

No 

237 

10 

118 

6 

Yes 

15 

4 

8 

2 

Yes 

No 

16 

2 

11 

1 

Yes 

7 

3 

6 

4 

Yes 

No 

No 

24 

3 

7 

3 

Yes 

3 

2 

3 

1 

Yes 

No 

6 

2 

4 

2 

Yes 

5 

11 

4 

7 

Source :  Data  courtesy  of  James  Ware. 
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All 


otherwise.  Let  y,  denote  the  response  at  age  t  (/  =  7,  8,  9,  10).  We  use  the  first-order 
regressive  logistic  model 

logit[/>(L,  =  1)]  =  a  +  P\y,-\  +  ft.v  +  ft/,  /  =  8, 9,  10. 

Each  subject  contributes  three  observations  to  the  model  fitting.  The  data  set  con¬ 
sists  of  12  binomials,  for  the  2x2x3  combinations  of  (y,_i ,s,  /).  For  instance,  for 
the  combination  (0,  0,  8),  vg  =  0  for  237  +  10  +  15  +  4  =  266  subjects  and  vg  =  1  for 
16  +  2  +  7  +  3  =  28  subjects.  The  ML  fit  is 

logit[P(ft  =  1)]  =  -0.293  + 2.21  ly,_,  +0.296.9-0.243/, 

with  effect  SE  values  (0.158,  0.156,  0.095).  Not  surprisingly,  the  previous  observation  has 
a  strong  effect.  Given  that  and  the  child’s  age,  there  is  slight  evidence  of  a  positive  effect  of 
maternal  smoking:  The  likelihood-ratio  statistic  for  Ho :  ft  =  0  is  3.55  (df  =  1,  P  =  0.06). 
The  model  itself  does  not  show  any  evidence  of  lack  of  fit  ( G 2  =  3. 12,  df  =  8). 

12.4.5  Example:  Initial  Response  in  Matched  Pair  as  a  Covariate 

Consider  matched-pairs  data  in  which  the  observations  occur  at  different  times.  It  can  be 
more  relevant  to  model  the  follow-up  response  using  the  initial  response  as  a  covariate, 
rather  than  treating  the  two  variables  symmetrically  in  a  marginal  model. 

We  illustrate  with  the  insomnia  study  of  Table  12.3  from  Section  12.1.3.  Let  ft  denote 
the  follow-up  ordinal  response,  for  treatments  with  initial  observation  y\ .  In  the  transitional 
model 


logitlftft  <  j)]  =  ctj  +  ftx  +  ftyi,  (12.10) 

ft  compares  the  follow-up  distributions  for  the  treatments,  adjusting  for  the  initial  ob¬ 
servation.  This  is  an  analog  of  an  analysis-of-covariance  model,  with  ordinal  rather  than 
continuous  response.  This  cumulative  logit  model  refers  to  a  univariate  response  (ft)  rather 
than  marginal  distributions  of  a  multivariate  response  (ft ,  ft). 

In  this  model,  we  use  scores  (10,  25,  45,  75)  for  the  four  categories  of  the  initial  time 
to  fall  asleep  y\.  Applying  software  for  ordinary  cumulative  logit  models  to  the  univariate 
response  ft,  the  ML  treatment  effect  estimate  is  ft  =  0.885  (SE  =  0.246).  This  provides 
strong  evidence  that  follow-up  time  to  fall  asleep  is  lower  for  the  active  drug  group.  For 
any  given  value  for  the  initial  response,  the  estimated  odds  of  falling  asleep  by  a  particular 
time  for  the  active  treatment  are  exp(0.885)  =  2.42  times  those  for  the  placebo  group. 

For  matched-pairs  data,  marginal  models  evaluate  how  the  marginal  distributions  of 
ft  and  ft  depend  on  explanatory  variables.  By  contrast,  a  transitional  model  treats  ft  as 
a  univariate  response,  evaluating  effects  of  explanatory  variables  while  adjusting  for  the 
initial  response  y\ .  In  some  situations,  whether  an  effect  of  a  certain  type  exists  may  differ 
between  these  two  types  of  model.  For  example,  suppose  the  true  marginal  distributions 
for  initial  response  are  identical  for  the  treatment  groups,  as  we  expect  with  random 
assignment  of  subjects  to  the  groups.  Suppose  also  that  there  is  no  treatment  effect,  in 
the  sense  that  conditional  on  the  initial  response,  the  follow-up  response  distribution  is 
identical  for  the  treatment  groups.  Then,  the  follow-up  marginal  distributions  are  also 
identical.  By  contrast,  suppose  the  initial  marginal  distributions  are  not  identical,  as  might 
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well  happen  with  observational  data  for  which  randomization  of  subjects  is  not  possible. 
Then,  even  when  the  conditional  distributions  for  follow-up  response  are  identical  for  the 
two  treatment  groups,  the  difference  between  follow-up  and  initial  marginal  distributions 
may  differ  between  the  treatment  groups.  In  such  cases,  it  is  more  informative  to  use  the 
transitional  model. 


12.4.6  Transitional  Models  and  Loglinear  Conditional  Models 

In  this  chapter  we  have  mainly  focused  on  marginal  models.  Transitional  models,  by 
contrast,  are  conditional,  with  the  effects  on  y,  being  described  conditionally  on  previously 
observed  responses. 

More  generally,  we  could  ignore  the  ordering  on  t  and  consider  models  in  which  y,  is 
modeled  in  terms  of  yu  with  both  u  <  t  and  with  u  >  t.  This  is  essentially  what  we  do 
with  standard  loglinear  models,  in  which  we  model  effects,  associations,  and  interactions 
conditional  on  all  the  other  response  variables.  Although  such  models  are  of  use  for 
describing  joint  distributions  of  several  response  variables,  as  we  did  in  Chapters  9  and  10, 
they  are  usually  of  less  relevance  when  we  are  analyzing  effects  of  explanatory  variables 
x.  Normally,  in  considering  the  effect  of  an  explanatory  variable  on  a  response  yf,  it  is  not 
relevant  to  describe  that  effect  conditional  on  all  other  responses. 


NOTES 

Section  12.1:  Marginal  Modeling :  Maximum  Likelihood  Approach 

12.1  Marginal  ML:  For  other  work  on  ML  fitting  of  marginal  models,  see  Bergsma  and  Rudas 
(2002),  Bergsma  et  al.  (2009),  Colombi  (1998),  Drton  and  Richardson  (2008),  Ekholm  et  al. 
(2000),  Fitzmaurice  and  Laird  ( 1 993),  Lang  (2004, 2005),  Lang  et  al.  ( 1 999),  and  Molenberghs 
and  Verbeke  (2005,  Chap.  6,  7),  with  the  last  reference  also  modeling  global  odds  ratios 
for  ordinal  responses.  Ekholm  et  al.  (2000)  modeled  association  factors  (Sec.  2.4.2,  which 
they  referred  to  as  dependence  ratios )  and  corresponding  higher-order  measures  that  they 
used  together  with  the  marginal  probabilities  to  parameterize  a  multivariate  binary  response. 
Ashford  and  Sowden  (1970),  Lesaffre  and  Molenberghs  (1991),  and  Ochi  and  Prentice  ( 1 984) 
presented  multivariate  probit  models  that  have  probit  models  for  the  margins. 


Section  12.2:  Marginal  Modeling:  Generalized  Estimating  Equations  (GEE)  Approach 

12.2  GEE:  Fitzmaurice  et  al.  (1993),  Liang  et  al.  (1992),  Molenberghs  and  Verbeke  (2005,  Chap. 
8-10),  and  Sutradhar  (2003)  discussed  GEE  methods  for  categorical  (primarily  binary)  re¬ 
sponses.  For  multinomial  responses,  see  Heagerty  and  Zeger  (1996),  Lipsitz  et  al.  (1994), 
Miller  et  al.  (1993),  Parsons  et  al.  (2009),  Touloumis  (201 1),  and  references  in  Agresti  and 
Natarajan  (2001).  More  general  models  with  ordinal  responses  allow  for  dispersion  parame¬ 
ters  that  also  depend  on  covariates  (Toledano  and  Gatsonis  1 996).  LaVange  et  al.  (2001 )  used 
GEE  methods  to  account  for  clustered  sampling  in  surveys  and  clinical  trials. 

12.3  WLS:  Koch  et  al.  (1977)  used  weighted  least  squares  (WLS)  to  fit  marginal  models  to  Ta¬ 
ble  12.1 .  WLS  for  categorical  modeling  is  described  in  Section  1 6.7.1 .  It  has  severe  limitations 
(e.g.,  covariates  must  be  categorical  and  marginal  tables  cannot  be  sparse)  but  led  naturally  to 
the  GEE  approach  (Miller  et  al.  1993). 
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Section  12.3:  Quasi-likelihood  and  Its  GEE  Multivariate  Extension:  Details 

12.4  Quasi-likelihood  and  model  misspecification:  Firth  ( 1 993b)  gave  an  overview  of  quasi¬ 
likelihood  methods.  Besides  McCullagh  (1983),  Heyde  (1997)  and  Liang  and  Zeger  (1995) 
discussed  unbiased  estimating  functions  and  their  connections  with  asymptotic  consistency 
and  efficiency.  Godambe  showed  in  1960  that  ML  estimators  are  optimal  solutions  with  an 
unbiased  estimating  function.  When  quasi-likelihood  estimators  are  not  ML,  Cox  (1983)  and 
Firth  (1987)  suggested  that  they  still  retain  good  efficiency  when  the  departure  from  the 
natural  exponential  family  is  at  most  moderate,  such  as  modest  overdispersion  relative  to 
such  a  family.  The  GEE  methods  proposed  by  Liang  and  Zeger  (1986,  1995)  also  built  on 
related  theory  in  the  econometrics  literature  about  model  misspecification.  See  Gourieroux 
et  al.  (1984),  Hansen  (1982),  and  White  (1982). 

12.5  GEE/ML/GEE2:  The  generalized  estimating  equations  are  likelihood  equations,  and  hence 
the  GEE  estimates  are  also  ML,  in  certain  cases.  Examples  are  multivariate  normal  data  or 
binary  data  when  the  working  covariance  is  correct  (Fitzmauriceet  al.  1993).  A  GEE2  analysis 
adds  estimating  equations  for  the  correlation  structure  (Prentice  and  Zhao  1991).  This  has 
the  potential  to  increase  efficiency.  A  disadvantage  is  that,  unlike  with  ordinary  GEE,  jl  is  no 
longer  consistent  if  this  part  of  the  model  is  misspecified. 

12.6  Missing  data:  Surveys  of  ways  to  handle  missing  data  include  Fleiss  et  al.  (2003,  Chap. 
16),  Little  (2005),  Little  and  Rubin  (2002),  and  Molenberghs  and  Verbeke  (2005,  Chap.  26- 
32).  See  also  Altham  (2010),  Baker  and  Laird  (1988),  Fitzmaurice  et  al.  (1994),  Forster  and 
Smith  ( 1998),  Ibrahim  et  al.  (2005),  Molenberghs  and  Goetghebeur  (1997),  Park  and  Brown 
(1994),  and  Rubin  (1996).  Stokes  et  al.  (2012)  showed  how  to  build  the  missingness  pattern 
into  a  model  to  check  whether  it  is  associated  with  the  response  or  interacts  with  effects  of 
explanatory  variables. 


Section  12.4:  Transitional  Models:  Markov  Chain  and  Time  Series  Models 

12.7  Markov  chains:  For  statistical  inference  with  Markov  chains,  see  Andersen  (1980,  Sec. 
7.7),  Anderson  and  Goodman  (1957),  Billingsley  (1961),  Bishop  et  al.  (1975,  Chap.  7), 
and  Kalbfleisch  and  Lawless  (1985).  Conaway  (1989),  Hoeting  et  al.  (2000),  Stiratelli  et  al. 
( 1 984),  and  Ware  et  al.  ( 1 988)  proposed  other  analyses  focusing  on  the  conditional  dependence 
structure. 

12.8  Time  series:  For  time  series  modeling  of  a  categorical  response,  see  Azzalini  ( 1 994),  Bonney 
(1987),  Cox  (1970),  Fahrmeir  and  Kaufmann  (1987),  Fokianos  and  Kedem  (2002,  2003), 
Heagerty  (2002),  Kalbfleisch  and  Lawless  (1985),  Klingenberg  (2008).  Liang  and  Zeger 
(1989),  Muenz  and  Rubinstein  (1985),  Staffer  et  al.  (1993),  Varin  and  Vidoni  (2006),  Zeger 
and  Qaqish  (1988),  Zhao  and  Prentice  (1990),  and  the  many  references  in  Fokianos  and  Kedem 
(2003)  and  Klingenberg  (2008).  Transitional  models  can  also  incorporate  latent  variables  (e.g., 
Lin  et  al.  2008),  as  discussed  in  Section  14.1.5. 


EXERCISES 

Applications 

12.1  For  the  attitudes  about  abortion  data  of  Table  1 1 . 1 3  in  Section  1 1 .7.4,  consider  the 
model 


logit[F(T,  =  !)]=«  +  Pil(t  =  1)  +  foUt  =  2), 
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where  the  indicators  refer  to  (1)  low  income  and  (2)  unmarried,  the  effects  con¬ 
trasting  each  with  endangered  woman’s  health. 

a.  Find  the  ML  estimates  of  fi\  and  /L  and  their  SE  values  by  treating  the  three 
observations  for  a  subject  as  independent. 

b.  Find  the  GEE  estimates  and  empirical  SE  values  based  on  working  correlations 
structures  (i)  exchangeable,  (ii)  independence.  Why  are  the  SE  values  smaller 
with  the  GEE  analyses  than  in  (a)?  When  would  you  expect  them  to  be  larger? 

c.  Find  the  ML  estimates  of  and  /L  and  their  SE  values  by  treating  the  three 
observations  for  a  subject  as  dependent.  Compare  to  results  in  (a). 

12.2  For  Table  10. 1 ,  fit  a  marginal  model  by  ML  or  GEE  to  describe  main  effects  of  race, 
gender,  and  substance  type  (marijuana,  alcohol,  cigarettes)  on  whether  a  subject 
had  used  that  substance.  Summarize  effects. 

12.3  Refer  to  Exercise  12.2.  Further  study  shows  evidence  of  an  interaction  between 
gender  and  substance  type.  Using  GEE  with  exchangeable  working  correlation,  the 
model  fit  for  the  probability  n  of  using  a  particular  substance  is 

logit(;f )  —  —0.57  +  0.38r— 0.20g  +  1.935]  +  O.8652  +  0.37g  x  5]  -I-  0.22#  x  s2, 

where r,  g,s\,s2  are  indicator  variables  for  race  ( 1  =  white),  gender  ( 1  =  female), 
and  substance  type  (5|  =  1,52  =  0  for  alcohol;  5]  =0,52=  1  for  cigarettes;  5]  = 
52  =  0  for  marijuana).  Show  that; 

a.  The  estimated  odds  that  a  nonwhite  male  has  used  marijuana  are  0.57. 

b.  Given  gender,  the  estimated  odds  that  a  white  subject  used  a  given  substance 
are  1.46  times  the  estimated  odds  for  a  black  subject. 

c.  Given  race,  the  estimated  odds  that  a  female  has  used  alcohol  are  1.19  times  the 
estimated  odds  for  males;  for  cigarettes  and  for  marijuana,  the  estimated  odds 
ratios  are  1 .02  and  0.82. 

d.  Given  race,  the  estimated  odds  that  a  female  has  used  alcohol  (cigarettes)  are 
9.97  (2.94)  times  the  estimated  odds  she  has  used  marijuana,  and  the  estimated 
odds  that  a  male  has  used  alcohol  (cigarettes)  are  6.89  (2.36)  times  the  estimated 
odds  he  has  used  marijuana.  Interpret  the  interaction. 

12.4  For  Table  12.1  from  the  depression  study,  fit  by  ML  or  GEE  the  marginal  logistic 
model  allowing  treatment  x  time  interaction,  using  the  time  scores  (1,  2,  4)  for 
the  week  number.  Interpret  estimates.  Compare  substantive  results  to  those  in 
Section  12.1.1  for  scores  (0,  1,2). 

12.5  Analyze  Table  12.8  using  a  marginal  logistic  model  with  age  and  maternal  smoking 
as  predictors.  Compare  interpretations  to  the  Markov  model  of  Section  12.4.4. 

12.6  Table  1 2.9  refers  to  a  three-period  crossover  trial  to  compare  placebo  (treatment  A) 
with  a  low-dose  analgesic  (treatment  B)  and  high-dose  analgesic  (treatment  C)  for 
relief  of  primary  dysmenorrhea.  Subjects  in  the  study  were  divided  randomly  into 
six  groups,  the  possible  sequences  for  administering  the  treatments.  At  the  end  of 
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Table  12.9  Crossover  Trial  Data  for  Exercise  12.6 


Treatment 

Sequence 

Response  Pattern  for  Treatments  (4.  B,  C) 

000 

001 

010 

Oil 

100 

101 

1 10 

1 1 1 

ABC 

0 

2 

2 

9 

0 

0 

1 

1 

ACB 

2 

0 

0 

9 

1 

0 

0 

4 

BAC 

0 

1 

1 

8 

1 

3 

0 

1 

BCA 

0 

1 

1 

8 

1 

0 

0 

1 

CAB 

3 

0 

0 

7 

0 

1 

2 

1 

CBA 

1 

5 

0 

4 

0 

3 

1 

0 

Source:  Jones  and  Kenward  (1987). 


each  period,  each  subject  rated  the  treatment  as  giving  no  relief  (0)  or  some  relief 
(1).  Let  y,(k)t  —  1  denote  relief  for  subject  i  nested  in  treatment  sequence  k,  using 
treatment  t  (t  =  A,  B.C).  Assuming  common  treatment  effects  for  each  sequence, 
and  setting  ft  a  =  0,  obtain  and  interpret  {/), }  (using  ML  or  GEE)  for  the  model 


logit[P(T,a),  =  1)]  =ak  +  &. 


How  would  you  order  the  drugs,  taking  significance  into  account? 

12.7  Table  12.10  is  from  a  Kansas  State  University  survey  of  262  pig  farmers.  For  the 
question  “What  are  your  primary  sources  of  veterinary  information?,”  the  categories 


Table  12.10  Veterinary  Information  Data  for  Exercise  12.7 


Educ 

Pigs 

E 

Response  on  D 

A  = 

yes 

A  = 

no 

B  = 

yes 

B  = 

no 

B  = 

yes 

B  = 

no 

C  = 

yes 

c  = 

no 

c  = 

yes 

c 

—  no 

c  = 

yes 

c  = 

no 

c  = 

yes 

c  = 

no 

Y 

N 

Y 

N 

Y 

N 

Y 

N 

Y 

N 

Y 

N 

Y 

N 

Y 

N 

No 

<  1 

Y 

1 

0 

0 

0 

0 

0 

0 

0 

2 

1 

1 

2 

1 

1 

5 

3 

N 

0 

0 

0 

0 

0 

0 

0 

1 

1 

0 

0 

5 

4 

7 

7 

0 

1-2 

Y 

2 

0 

0 

0 

0 

0 

0 

0 

4 

0 

0 

4 

1 

0 

0 

4 

N 

0 

0 

0 

0 

0 

0 

0 

0 

0 

0 

0 

5 

0 

3 

4 

0 

2-5 

Y 

3 

0 

0 

0 

0 

0 

0 

0 

3 

0 

0 

1 

2 

0 

1 

1 

N 

1 

0 

0 

0 

0 

0 

0 

3 

0 

0 

0 

2 

0 

1 

4 

0 

>  5 

Y 

2 

0 

0 

0 

0 

0 

0 

0 

1 

0 

1 

0 

0 

1 

0 

2 

N 

1 

0 

0 

2 

1 

0 

1 

6 

0 

1 

1 

1 

0 

0 

6 

0 

Some 

<  1 

Y 

3 

0 

0 

0 

0 

0 

0 

0 

4 

0 

1 

1 

0 

0 

2 

1 1 

N 

0 

0 

0 

0 

0 

0 

0 

0 

4 

0 

1 

2 

4 

6 

14 

0 

1-2 

Y 

0 

0 

0 

0 

0 

0 

0 

0 

2 

0 

0 

1 

0 

0 

1 

6 

N 

0 

0 

0 

0 

1 

0 

0 

1 

2 

1 

0 

4 

2 

7 

14 

0 

2-5 

Y 

0 

0 

0 

0 

0 

0 

0 

0 

1 

0 

0 

0 

0 

1 

1 

3 

N 

1 

0 

0 

0 

0 

0 

0 

0 

0 

0 

0 

5 

0 

4 

4 

0 

>  5 

Y 

1 

0 

0 

0 

0 

0 

0 

0 

0 

0 

1 

1 

0 

0 

0 

2 

N 

1 

1 

0 

0 

0 

1 

0 

10 

0 

0 

0 

4 

1 

2 

4 

0 

Source:  Data  courtesy  of  Tom  Loughin. 
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were  (/4)  professional  consultant,  ( B )  veterinarian,  (C)  state  or  local  extension 
service,  ( D )  magazines,  and  (E)  feed  companies  and  reps.  Farmers  sampled  were 
asked  to  select  all  relevant  categories.  The  25  x  2  x  4  table  shows  the  (yes,  no) 
counts  for  each  of  these  five  sources  cross-classified  with  the  farmers’  education 
(whether  they  had  at  least  some  college  education)  and  size  of  farm  (number  of 
pigs  marketed  annually,  in  thousands). 

a.  Explain  why  it  is  not  proper  to  analyze  the  data  by  fitting  a  multinomial  model 
to  the  counts  in  the  2x4x5  contingency  table  cross-classifying  education 
by  size  of  farm  by  the  source  of  veterinary  information,  treating  source  as  the 
response  variable.  (This  table  contains  453  positive  responses  of  sources  from 
the  262  farmers.) 

b.  For  a  farmer  with  education  i  and  size  of  farm  s,  let  7Tj(is)  denote  the  probability 
of  responding  “yes”  on  source  j.  Table  12. 1 1  shows  output  for  using  GEE  with 
exchangeable  working  correlation  to  estimate  parameters  in  the  model  lacking 
an  education  effect. 


logit[jry(/T)]  =  ctj  +  j}js,  s  =  1 , 2,  3,  4. 

Explain  how  to  interpret  the  working  correlation  matrix.  Explain  why  the  results 
suggest  a  strong  positive  size  of  farm  effect  for  source  A  and  perhaps  a  weak 
negative  size  effect  of  similar  magnitude  for  C,  D,  and  E. 

c.  Constraining  =  /$4  =  the  ML  estimate  of  the  common  slope  is 
-0. 1 84  (SE  —  0.063).  Explain  why  it  is  advantageous  to  fit  the  marginal  model 


Table  12.11  Output  for  Veterinary  Data  of  Exercise  12.7 


Working  Correlation  Matrix 


Coll 

Col2 

coi3 

Col4 

Co  15 

Rowl 

1.0000 

0 . 0997 

0.0997 

0.0997 

0 . 0997 

Row2 

0.0997 

1.0000 

0.0997 

0.0997 

0.0997 

Row3 

0 . 0997 

0 . 0997 

1.0000 

0 . 0997 

0.0997 

Row4 

0 . 0997 

0 . 0997 

0.0997 

1 . 0000 

0 . 0997 

Row5 

0.0997 

0 . 0997 

0 . 0997 

0 . 0997 

1 . 0000 

Analysis  Of  GEE  Parameter  Estimates 
Empirical  Standard  Error  Estimates 


Parameter 

Estimate 

Std  Error 

Z 

Pr  >  |Z| 

source 

1 

-4.4994 

0.6457 

-6.97 

<.0001 

source 

2 

-0.8279 

0.2809 

-2.95 

0 . 0032 

source 

3 

-0.1526 

0.2744 

-0.56 

0 . 5780 

source 

4 

0.4875 

0 .2698 

1.81 

0 . 0708 

source 

5 

-0.0808 

0.2738 

-0.30 

0 . 7680 

size* source 

1 

1.0812 

0 . 1979 

5.46 

<.0001 

size*source 

2 

0 . 0792 

0 . 1105 

0.72 

0.4738 

size*source 

3 

-0.1894 

0 . 1121 

-1.69 

0 . 0912 

size*source 

4 

-0.2206 

0.1081 

-2.04 

0 . 0412 

size*source 

5 

-0.2387 

0.1126 

-2.12 

0 . 0341 
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simultaneously  for  all  sources  rather  than  separately  to  each.  [Agresti  and  Liu 
( 1 999)  and  Loughin  and  Scherer  (1998)  discussed  analyses  for  data  of  thi  s  form.] 

12.8  Refer  to  Table  13.3  in  Section  13.3.2  on  attitudes  toward  legalized  abortion.  For 
the  response  Y,  (1  =  support  legalization,  0  =  oppose)  for  question  t  (t  =  1 , 2,  3) 
and  for  gender  g  (1  =  female,  0  =  male),  consider  the  model:  logitlPlT,  =  1)]  = 
a  +  YS  +  Pi  w'th  Pi  =  o. 

a.  A  GEE  analysis  using  unstructured  working  correlation  gives  correlation  esti¬ 
mates  0.826  for  questions  1  and  2,  0.797  for  1  and  3,  and  0.832  for  2  and  3. 
What  does  this  suggest  about  a  reasonable  working  correlation  structure?  Why? 

b.  Table  12.12  shows  a  GEE  analysis  with  exchangeable  working  correlation. 
Interpret  effects. 

c.  Treating  the  three  responses  for  each  subject  as  independent  observations  and 
performing  ordinary  logistic  regression,  fi\  =  0.149  ( SE  =  0.066),  p2  =  0.052 
(SE  =  0.066),  and  y  —  0.004  (SE  =  0.054).  Give  a  heuristic  explanation  of 
why  within-subject  standard  errors  are  much  larger  than  with  GEE,  yet  the 
between-subject  standard  error  is  smaller.  (See  also  Exercise  12.21.) 

Table  12.12  Output  for  Exercise  12.8  on  Abortion  Attitudes 


Working 

Correlation 

Matrix 

Coll 

Col  2 

Col3 

Rowl 

1 . 0000 

0.8173 

0.8173 

Row2 

0.8173 

1.0000 

0 . 8173 

Row3 

0.8173 

0.8173 

1.0000 

Analysis  Of  GEE  Parameter  Estimates 
Empirical  Standard  Error  Estimates 


Parameter 

Estimate 

Std  Error 

Z 

Pr  >  |Z| 

Intercept 

-0.1253 

0 . 0676 

-1.85 

0 . 0637 

question  1 

0.1493 

0.0297 

5.02 

<  .0001 

question  2 

0.0520 

0.0270 

1.92 

0 . 0544 

question  3 

0.0000 

0.0000 

female 

0 . 0034 

0 . 0878 

0.04 

0 . 9688 

12.9  For  the  air  pollution  data  in  Table  12.13,  using  ML  or  GEE,  fit  marginal  logistic 
models  that  assume  (a)  marginal  homogeneity,  (b)  a  linear  effect  of  time,  and 
(c)  no  pattern.  Interpret  and  compare. 

12.10  Use  GEE  methods  to  analyze  the  clinical  trials  data  in  Table  13.7,  treating  obser¬ 
vations  within  each  center  as  a  correlated  cluster. 

12.11  For  Table  1 1.6,  use  GEE  methods  with  cumulative  logits  to  compare  the  two 
marginal  distributions.  Compare  results  to  those  using  ML  in  Section  1 1 .3.4. 

12.12  For  the  Presidential  voting  summarized  in  Table  11.1,  suppose  100  men  in  the 
sample  voted  in  the  2004  election  but  not  in  the  2008  election.  Of  them,  in  2004, 
50  voted  Democrat  and  50  voted  Republican.  Show  how  you  can  use  all  533 
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Table  12.13  Results  of  Breath  Test  at  Four  Ages 


T9 

Y 10 

Y 1 1 

Yl2 

Count 

T9 

T,o 

T„ 

Y 12 

Count 

1 

1 

1 

1 

94 

0 

1 

1 

1 

19 

1 

1 

1 

0 

30 

0 

1 

1 

0 

15 

1 

1 

0 

1 

15 

0 

1 

0 

1 

10 

1 

1 

0 

0 

28 

0 

1 

0 

0 

44 

1 

0 

1 

1 

14 

0 

0 

1 

1 

17 

1 

0 

1 

0 

9 

0 

0 

1 

0 

42 

1 

0 

0 

1 

12 

0 

0 

0 

1 

35 

1 

0 

0 

0 

63 

0 

0 

0 

0 

572 

Source'.  Ware  et  al.  (1988). 


observations  to  estimate  the  distribution  of  Y\,  the  433  complete  observations  to 
estimate  the  distribution  of  (T2|T|),  and  use  these  results  to  estimate  the  joint 
distribution.  What  assumption  does  this  analysis  make?  Explain  why  the  implied 
estimate  of  the  odds  ratio  is  the  same  as  using  only  the  complete  observations. 

12.13  Refer  to  transitional  models  for  the  insomnia  study  of  Table  1 2.3. 

a.  To  compare  effects  while  adjusting  for  the  initial  response,  fit  model  (12.10), 
using  scores  {10,  25,  45,  75}  for  time  to  falling  asleep.  Also  fit  the  interaction 
model,  and  describe  the  lack  of  fit.  (Note  that  for  the  first  two  baseline  levels,  the 
active  and  placebo  treatments  have  similar  sample  response  distributions  at  the 
follow-up;  at  higher  baseline  levels,  the  active  treatment  seems  more  successful.) 

b.  Fit  the  interaction  model 


logit[P(y2  <  ?')]  =  a,  +  P\X  +  ft.V i  +  P?(x  x  y,) 

that  constrains  effects  {fi\X  +  fcy l  +  Pixy\}  to  follow  the  pattern  (r,  r,  A  + 
cr,  A)  for  the  active  group  and  (r,  r,  a,  0)  for  the  placebo  group.  Interpret  X. 


12.14  Table  12.13  refers  to  a  longitudinal  study  at  Harvard  of  effects  of  air  pollution 
on  respiratory  illness  in  children.  The  children  were  examined  annually  at  ages  9 
through  12  and  classified  according  to  the  presence  or  absence  of  wheeze.  Denote 
the  binary  response  ( 1  =  wheeze,  0  =  no  wheeze)  by  Y,  at  age  t,  t  —  9,  10,  11,  12. 

a.  Explain  why  the  loglinear  model  (E9T10,  Eiol'u.  ^11^12)  represents  a  first-order 
Markov  chain.  Show  that  it  has  G 2  —  122.9  (df  =  8). 

b.  Explain  why  the  model  (T9T10T]  1 ,  T'toT't  1  Y\j)  represents  a  second-order  Markov 
chain  and  satisfies  conditional  independence  at  ages  9  and  12,  given  states  at 
ages  10  and  1 1.  Show  it  has  G 2  =  23.9  (df  =  4). 

c.  Show  that  the  loglinear  model  (T9T10,  T9Tn,  T9Ti2,  ^io^id  T|0Ti2,  TnT|2)  has 
G2  —  1.5  (df  =  5).  Show  that  the  association  seems  similar  for  pairs  of  ages 
1  year  apart,  and  somewhat  weaker  for  pairs  of  ages  more  than  1  year  apart. 
Show  that  the  simpler  model  in  which 


}  T9T10  aTioTjj  -\Y  I1T12 

A-  Ajj  Ajj 


and 


2Y9Yn 


!  Y9YI2 
y 


=  A. 


T10T12 
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fits  well,  with  G 2  =  2.3  (df  =  9)  and  estimated  log  odds  ratios  of  1.75  in  the 
first  case  and  1 .04  in  the  second. 


12.15  Refer  to  Table  1 2.8  on  respiratory  illness  and  maternal  smoking. 

a.  Does  a  transitional  model  with  two  previous  responses  fit  better  than  the  first- 
order  model  of  Section  1 2.4.4?  Interpret. 

b.  Combine  the  data  for  the  two  levels  of  maternal  smoking.  Does  a  first-order 
Markov  chain  model  these  data  adequately?  Find  a  loglinear  model  that  fits 
adequately. 

12.16  Analyze  the  depression  data  of  Table  12.1  using  a  first-order  transitional  model. 
Compare  interpretations  to  those  using  marginal  models. 

12.17  Table  12. 14  is  from  a  longitudinal  study  of  coronary  risk  factors  in  schoolchildren 
(Woolson  and  Clarke  1984).  A  sample  of  children  aged  1 1-13  in  1977  were  classi¬ 
fied  by  gender  and  by  relative  weight  (obese,  not  obese)  in  1977,  1979,  and  1981 . 
Analyze  these  data. 


Table  12.14  Data  for  Exercise  12.17  on  Weight  Trajectories 


Responses" 

Gender 

NNN 

NNO 

NON 

NOO 

ONN 

ONO 

OON 

000 

Male 

119 

7 

8 

3 

13 

4 

1 1 

16 

Female 

129 

8 

7 

9 

6 

2 

7 

14 

“NNN  indicates  not  obese  in  1977,  1979,  and  1981;  NNO  indicates  not  obese  in  1977  and  1979  but  obese  in  1981 : 
and  so  on. 

Source:  Reproduced  with  permission  from  the  Royal  Statistical  Society,  London  (Woolson  and  Clarke  1984). 


12.18  Analyze  the  pig  farmer  data  of  Exercise  12.7  using  marginal  models  with  all  the 
variables. 

12.19  Lesaffre  and  Spiessens  (2001)  analyzed  data  from  a  dermatology  clinical  trial  for 
toenail  infection.  Analyze  the  data  for  the  binary  endpoint,  which  are  available  at 
www .  blackwellpublishing .  com/rss,  using  methods  of  this  chapter.  Prepare  a 
two-page  report  summarizing  your  analyses,  and  include  an  appendix  with  relevant 
software  output. 

12.20  Results  of  the  annual  boat  race  between  crews  from  Oxford  and  Cambridge  are 
shown  at  www.theboatrace.org.  Consider  the  time  series  consisting  of  the 
sequence  of  winners,  excluding  the  dead  heat  in  1877.  Analyze  these  data.  See 
Section  13.3.10  for  an  analysis  that  accounts  for  weight  differences  between  the 


crews. 
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Theory  and  Methods 

12.21  Recall  that  positive  correlation  results  in  reduced  SE  values  for  within-cluster 
estimated  effects.  How  about  between-cluster  effects?  Suppose  yn,  . . . ,  yir  are 
Bernoulli  trials  with  E(y\j)  =  K\,  suppose  3*21.  •••  >  yn  are  Bernoulli  trials  with 
E(y2j)  =  n2.  Suppose  that  for  i  =  1,  2,  corr(y/;,  yjk)  =  p  and  corrfviy,  yik)  =  0 
for  all  j  7^  k.  Find  SE(A\  —  n2),  and  show  that  it  is  larger  when  p  >  0  than  when 
observations  within  the  two  samples  are  independent. 

12.22  In  the  example  in  Section  12.3.3  of  a  common  mean  model  for  count  data, 
the  model-based  variance  estimate  is  y/n ,  whereas  the  sandwich  estimator  is 
£,(y,  —  y)2/«2.  Which  would  you  expect  to  be  better  (a)  if  the  Poisson  model 
holds,  and  (b)  if  there  is  severe  overdispersion?  Why? 

12.23  In  the  example  in  Section  12.3.3  of  a  common  mean  model  for  count  data,  suppose 
we  assume  that  v(/x,)  =  a1  when  actually  var(F,)  =  p, .  Find  the  model-based 
asymptotic  variance,  the  actual  asymptotic  variance,  and  the  sandwich  estimator  of 
the  actual  variance. 

12.24  Consider  the  model  of  marginal  homogeneity  for  matched-pairs  binary  data,  ex¬ 
pressed  in  form,  P(Y,  =  1)  =  t  =  1, 2.  Show  how  the  GEE  expressions  for  the 
working  covariance  matrix  (12.6),  the  estimating  equations  (12.7),  and  variance 
V G  of  y/nfr  simplify,  assuming  working  correlation  structure  (a)  independence, 
(b)  exchangeable  (i.e.,  allowing  correlation,  as  there  is  only  one  pair). 

12.25  Show  that  (12.4)  is  equivalent  to  the  formula  for  the  large-sample  covariance  of  the 
ML  estimator  in  a  GLM,  estimated  by  (4.31). 

12.26  a.  Explain  the  sense  in  which  GEE  methodology  is  a  multivariate  version  of  QL. 

b.  Summarize  the  advantages  and  disadvantages  of  the  QL  approach. 

c.  Describe  conditions  under  which  GEE  parameter  estimators  are  consistent  and 
conditions  under  which  they  are  not.  For  conditions  in  which  they  are  consistent, 
explain  why. 

12.27  Refer  to  the  interpretation  at  the  end  of  Section  12.1.3  based  on  shifts  in  the  mean  for 
the  insomnia  study.  State  a  normal  latent  variable  model,  and  show  how  to  generate 
a  similar  result  by  comparing  estimates  of  the  number  of  standard  deviation  shift 
between  initial  and  follow-up  responses  for  each  group. 

12.28  For  the  analysis  of  the  insomnia  data  at  the  end  of  Section  12.1.3,  explain  how 
to  calculate  the  SE  for  the  difference  between  the  difference  of  means  reported 
there.  (Note  that  one  difference  uses  paired  samples  and  the  other  uses  independent 
samples.) 

12.29  For  a  poll  of  a  random  sample  of  1800  voting-age  British  citizens,  followed-up 
six  months  later,  794  indicated  approval  of  the  Prime  Minister’s  performance  in 
office  each  time,  570  indicated  disapproval  each  time,  150  indicated  approval 
initially  and  disapproval  later,  86  indicated  disapproval  initially  and  approval 
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later,  and  200  responded  at  the  first  survey  (100  approving  and  100  disapproving) 
but  not  at  the  second  survey.  Denote  the  response  by  X  at  the  first  survey  and 
by  Y  at  the  second  survey.  Let  A|  =  P(Y  —  1|A  =  1)  —  P(Y  —  \  \X  —  2)  and 
A2  —  P(X  =  1|T  =  1)  —  P(X  —  1 1 K  =  2). 

a.  Explain  why  the  missing  observations  do  not  appear  to  be  MCAR.  [Hint:  Would 
you  expect  to  see  this  distribution  of  missingness  if  the  missing  observations 
were  a  random  sample  of  all  the  observations?] 

b.  Explain  intuitively  why  there  is  insufficient  information  to  determine  whether 
the  missing  observations  are  MAR. 

c.  Under  the  MAR  assumption,  explain  intuitively  why  you  can  use  the  fully 
observed  data  to  estimate  Aj  and  the  odds  ratio  but  not  A2. 

d.  Under  the  MAR  assumption,  use  the  information  about  the  distribution  of 
(T|X)  from  the  fully  observed  data  to  predict  how  the  missing  data  contribute 
to  the  cell  counts.  Use  them  together  with  the  observed  counts  to  estimate  A2. 
(For  details,  see  Fleiss  et  al.  2003,  Sec.  16.2.) 

12.30  What  is  wrong  with  this  statement?:  “For  a  first-order  Markov  chain,  Y,  is  inde¬ 
pendent  of  K,_2.” 

12.31  Gamblers  A  and  B  have  a  total  of  l  dollars.  They  play  games  of  pool  repeatedly.  In 
each  game  they  each  bet  $  1 ,  and  the  winner  takes  the  other’s  dollar.  The  outcomes 
of  the  games  are  statistically  independent,  and  A  has  probability  n  and  B  has 
probability  1  -  n  of  winning  any  game.  Play  stops  when  one  player  has  all  the 
money.  Let  Y,  denote  A’s  monetary  total  after  t  games. 

a.  Show  that  (T,}  is  a  first-order  Markov  chain. 

b.  State  the  transition  probability  matrix.  (For  this  gambler's  ruin  problem,  0  and  / 
are  absorbing  states.  Eventually,  the  chain  enters  one  of  these.  The  other  states 
are  transient .) 

12.32  For  a  first-order  Markov  chain,  let  X,  Y,  and  Z  denote  the  classifications  for  the 

/  x  /  x  T  table  consisting  of  {n,y(/),  i  =  1 y  =  1  1 ,  . .  • ,  T}. 

a.  Explain  why  all  transition  probabilities  are  stationary  if  expected  frequencies  for 
this  table  satisfy  loglinear  model  (XY,  XZ).  [Thus,  the  likelihood-ratio  statistic 
for  testing  stationary  transition  probabilities  equals  G2  for  testing  fit  of  model 
(XY,  XZ).] 

b.  For  a  Markov  chain  with  stationary  transition  probabilities,  let  denote  the 
number  of  transitions  from  i  to  j  to  k  over  two  successive  steps.  For  {«,)*},  argue 
that  the  goodness  of  fit  of  loglinear  model  (T|T2,  Y2Y3)  tests  that  the  chain  is 
first-order  against  the  alternative  that  it  is  second-order  (Anderson  and  Goodman 
1957). 


CHAPTER  13 


Clustered  Categorical  Data: 
Random  Effects  Models 


In  Chapter  12  we  dealt  with  observations  that  occur  in  clusters,  such  as  sets  of  repeated 
observations  over  time  for  subjects  in  a  longitudinal  study.  Observations  within  clusters 
are  usually  positively  correlated,  tending  to  be  more  alike  than  observations  from  different 
clusters.  Ordinary  analyses  that  ignore  the  correlation  and  treat  within-cluster  observations 
the  same  as  between-cluster  observations  produce  invalid  standard  errors,  tending  to  be  too 
small  for  between-cluster  effect  estimates  and  too  large  for  within-cluster  effects. 

In  Chapter  1 2  we  modeled  the  marginal  distributions  of  the  clustered  responses,  treating 
the  joint  dependence  structure  as  a  nuisance.  In  this  chapter  we  present  an  alternative 
approach  that  adds  cluster-level  terms  to  the  model  that  take  the  same  value  for  each 
observation  in  a  cluster.  They  are  unobserved  and,  when  treated  as  varying  randomly 
among  clusters,  are  called  random  effects.  We  introduced  this  approach  in  Section  1 1 .2.4 
with  a  model  for  matched  pairs.  The  models  have  effects  that  pertain  at  the  cluster  level.  We 
refer  to  such  effects  as  cluster-specific,  or  subject-specific  when  each  cluster  is  a  subject. 
By  contrast,  in  marginal  models  effects  have  population-averaged  interpretations. 

In  Section  13.1  we  extend  the  generalized  linear  model  to  include  random  effects,  giving 
a  generalized  linear  mixed  model.  In  Section  13.2  we  present  the  most  important  special 
case  for  binary  data,  the  logistic-normal  model,  which  uses  the  logit  link  and  assumes  a 
normal  distribution  for  the  random  effects.  We  show  several  examples  in  Section  13.3. 
Section  13.4  extends  this  model  to  multinomial  responses,  and  Section  13.5  introduces 
models  with  a  hierarchical,  “multilevel”  structure.  In  Section  13.6  we  discuss  model  fitting 
and  in  Section  13.7  the  Bayesian  approach  to  modeling  multivariate  categorical  data. 


13.1  RANDOM  EFFECTS  MODELING  OF  CLUSTERED 
CATEGORICAL  DATA 

Parameters  that  describe  a  factor’s  effects  in  an  ordinary  generalized  linear  model 
(GLM)  are  called  fixed  effects.  They  apply  to  all  categories  of  interest,  such  as  genders. 
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treatments,  or  age  groupings.  By  contrast,  random  effects  usually  apply  to  a  sample.  For 
a  study  that  makes  repeated  observations  on  a  sample  of  subjects,  for  example,  the  model 
treats  observations  from  a  given  subject  as  a  cluster,  and  it  has  a  random  effect  for  each 
subject. 

GLMs  extend  ordinary  regression  by  allowing  nonnormal  responses  and  a  link  function 
of  the  mean.  The  generalized  linear  mixed  model  (GLMM)  is  a  further  extension  that 
permits  random  effects  as  well  as  fixed  effects  in  the  linear  predictor. 


13.1.1  Generalized  Linear  Mixed  Model 

Let  y,-,  denote  observation  t  in  cluster  i,  t  —  1 , . . . ,  7).  As  in  marginal  models,  the  number 
of  observations  may  vary  by  cluster.  In  a  longitudinal  study,  even  if  clusters  have  equal  size, 
many  of  them  may  have  missing  observations.  Let  x,,  denote  a  column  vector  of  values  of 
explanatory  variables  for  this  observation. 

Let  m,  denote  the  vector  of  random  effects  for  cluster  i.  Often,  the  random  effect  is 
univariate.  Conditional  on  a  GLMM  resembles  an  ordinary  GLM.  Let  p.it  =  £(T„|h,). 
The  linear  predictor  for  a  GLMM  has  the  form 

g(m,)  =  xj,p  +  zj«;  (13.1) 

for  link  function  #(•)  and  fixed  effect  model  parameters  /?.  The  random  effect  vector  h, 
is  assumed  to  have  a  multivariate  normal  distribution  N( 0,  X).  The  covariance  matrix 
X  depends  on  unknown  variance  components  and  possibly  also  correlation  parameters. 
Conditional  on  standard  model  fitting  treats  {y,,}  as  independent  over  i  and  t.  As 
discussed  in  Section  11.2.2,  the  variability  among  h,  induces  nonnegative  associations 
among  the  responses,  for  the  marginal  distribution  averaged  over  the  subjects.  This  is 
caused  by  the  shared  random  effect  «,  for  each  observation  in  a  cluster. 

In  ( 1 3 . 1 ),  the  random  effect  enters  the  model  on  the  same  scale  as  the  predictor  terms.  This 
is  convenient  but  also  natural  for  many  applications.  For  instance,  random  effects  sometimes 
represent  heterogeneity  caused  by  omitting  certain  explanatory  variables.  Consider  the 
special  case  with  univariate  random  effect  and  z„  =  1 .  With  «,  replaced  by  u*o  where  {u*} 
are  N( 0,  1),  the  GLMM  has  the  form 


g(Mt)  =  +  u*cr. 

This  has  the  form  of  an  ordinary  GLM  with  unobserved  values  [u* }  of  a  particular  covariate. 
Thus,  random  effects  models  relate  to  methods  of  dealing  with  unmeasured  predictors  and 
other  forms  of  missing  data.  The  random  effects  part  of  the  linear  predictor  reflects  terms 
that  would  be  in  the  fixed  effects  part  if  those  explanatory  variables  had  been  included. 
Random  effects  also  sometimes  represent  random  measurement  error  in  the  explanatory 
variables.  If  we  replace  a  particular  jc„  by  x*  +  e,,  with  x*r  the  true  value  and  e,  the 
measurement  error,  then  e,  times  the  regression  parameter  can  be  absorbed  in  the  random 
effects  term.  Related  to  these  motivations,  random  effects  also  provide  a  mechanism  for 
explaining  overdispersion  in  basic  models  not  having  those  effects  (Breslow  and  Clayton 
1993,  Molenberghs  et  al.  2010). 
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13.1.2  Logistic  GLMM  with  Random  Intercept  for  Binary  Matched  Pairs 

We  illustrate  the  GLMM  expression  (13.1)  using  a  simple  case,  that  of  binary  matched 
pairs.  Cluster  i  consists  of  the  responses  (y,i ,  yn)  for  matched  pair  Observation  t  in 
cluster  i  has  y„  =  1  (a  success)  or  0  (a  failure),  t  =  1,2. 

In  Section  1 1 .2.2  we  introduced  the  model 

logit[P(y„  =  1)]  =<xi+pxl,  (13.2) 

where  X\  =  0  and  x2  =  1 .  For  it,  /3  is  a  cluster-specific  log  odds  ratio.  That  section  treated 
a,  as  a  fixed  effect  and  eliminated  it  using  conditional  ML.  An  equivalent  representation  of 
(13.2)  is 


logit[P(T,i  =  I|h,-)]  =<*+«i,  logit[P(T,2  =  1  |w,)]  =  a  +  fi  +  (13.3) 

where  it,  —  a,  —  a  for  some  constant  a.  Now,  we  treat  u,  as  a  random  effect  for  cluster  i, 
with  {u, }  independent  from  a  N( 0,  a2)  distribution  with  a  unknown.  Conditionally  on  , 
we  treat  y,  |  and  yi2  as  independent. 

Model  (13.3)  is  the  special  case  of  (13. 1)  in  which  /r„  =  P(Yn  —  1 1«,-),  g(-)  is  the  logit 
link,  fS1  =  (a,  /3),  xf{  =(1,0)  and  xj2  =  ( I,  1)  for  all  /,  and  z„  =  1  for  all  i  and  t.  The 
univariate  random  effect  adjusts  the  intercept  but  does  not  modify  the  fixed  effect.  A  GLMM 
with  random  effect  of  this  form  is  called  a  random  intercept  model.  Instead  of  the  usual 
fixed  intercept  a,  it  has  a  random  intercept  a  + 

Let  Y i  =  v/i  and  Y2  =  Yu-  These  are  dependent  binomial  variates.  Marginally, 
Y\  is  binomial  with  n  trials  and  parameter  £{exp(a  +  U)/[\  +  exp(a  +  £/)]},  and  Y2  is  bi¬ 
nomial  with  parameter  E (exp(a  +  f3  +  U)/[  1  +  exp(«  +  ft  +  U))}.  The  expectations  refer 
to  U,  a  N (0,  or2)  random  variable.  The  model  implies  a  nonnegative  correlation  between 
Y i  and  L2,  with  greater  association  resulting  from  greater  heterogeneity  (i.e.,  larger  a). 
Clusters  with  a  large  positive  u,  have  a  relatively  large  P ( Y„  =  1 1 u,-)  for  each  t,  whereas 
clusters  with  a  large  negative  u,  have  a  relatively  small  P ( Y„  —  1  \u,)  for  each.  For  this 
model,  Y]  and  K2  are  independent  only  if  a  —  0. 

A  2x2  population-averaged  table  with  (success,  failure)  for  both  the  row  and  column  cat¬ 
egories  summarizes  the  number  of  observations  for  which  (y,  i ,  yi2)  =  (1,  1),  (1 , 0),  (0,  1), 
or  (0, 0).  Let  {nah}  denote  these  counts.  Table  13.1  is  an  example.  Let  {pLab}  denote  marginal 
fitted  values  for  model  ( 13.3).  We  defer  discussion  of  model  fitting  until  Section  1 3.6.  How¬ 
ever,  model  (13.3)  is  a  rare  instance  in  which  the  fixed  effect  in  a  model  containing  random 


Table  13.1  Presidential  Votes  in  2004  and  in  2008,  for 
Males  Sampled  in  2010  by  the  General  Social  Survey 


2008  Election 

2004  Election 

Democrat 

Republican 

Total 

Democrat 

175 

16 

191 

Republican 

54 

188 

242 

Total 

229 

204 

433 

492 


CLUSTERED  CATEGORICAL  DATA:  RANDOM  EFFECTS  MODELS 


effects  has  a  closed-form  ML  estimate, 

A  =  log(A2i/Ai2)- 

When  the  sample  log  odds  ratio  log(«n«22/«i2«2i)  >  0,  then  {(Lab  =  nab}  and  A  = 
log(n2i/«i2)-  This  is  the  same  as  the  conditional  ML  estimate  (Section  1 1.2.3).  Neuhaus 
et  al.  (1994)  showed  that  this  is  true  for  any  parametric  choice  of  random  effects  distribu¬ 
tion  for  which  the  model  (13.3)  can  generate  { «„/,}  as  fitted  values.  Lindsay  et  al.  (1991) 
showed  that  this  estimate  also  results  with  a  nonparametric  approach  discussed  in  Section 
14.2.5.  The  model  implies  that  the  true  log  odds  ratio  for  this  2x2  table  is  at  least  0.  When 
log(«i  \n22/ n\2n2\)  <  0,  however,  then  a  =0  and  the  fitted  values  (A,?/,  =  na+n+b/n]  sat¬ 
isfy  independence.  Then,  A  is  identical  to  the  estimate  for  the  marginal  model  (1 1.6)  by 
which  A  is  the  difference  between  logits  for  the  two  marginal  distributions,  which  is  the 
log  odds  ratio  $  =  \og[(n+i/n+2)/(n]+/n2+)]. 

13.1.3  Example:  Changes  in  Presidential  Voting  Revisited 

Table  13.1  on  voting  in  successive  Presidential  elections  was  first  analyzed  in  Section  ILL 
For  it,  the  ML  fit  of  model  (13.3)  yields  A  =  log(54/16)  =  1.216  (SE  =  0.285),  with 
a  —5.22  describing  variability  of  the  random  effects.  This  is  identical  to  the  conditional  ML 
estimate  (1 1.10),  with  standard  error  [(1/54)  +  (1/16)] 1/2  =  0.285.  For  a  given  male  voting 
Democrat  or  Republican  in  these  two  elections,  the  estimated  odds  of  voting  Democrat  in 
2008  equal  exp(1.216)  =  3.375  times  the  odds  in  2004.  The  large  a  reflects  the  very  strong 
association  between  the  two  responses,  with  sample  odds  ratio  38. 1 . 


13.1.4  Extension:  Rasch  Model  and  Item  Response  Models 

An  extension  of  the  logistic  matched-pairs  model  (13.3)  allows  T  >  2  observations  in  each 
cluster.  The  random  intercept  model  then  has  form 


logit[P(T„  =  1  |w,)]  =  a  +  ft,  +  Uj,  (13.4) 

where  {«, }  are  independent  N( 0,  a2)  and  where  identifiability  requires  a  constraint  such  as 
Ai  =  0.  Equivalently,  the  model  can  delete  a  and  the  constraint  on  [ArJ- 

Early  applications  of  this  GLMM  were  in  psychometrics.  The  model  describes  responses 
to  a  battery  of  T  questions  on  an  exam.  The  probability  P ( T„  =  1 1«,  )  that  subject  i  makes 
the  correct  response  on  question  t  depends  on  the  overall  ability  of  subject  i,  characterized 
by  and  the  easiness  of  question  t,  characterized  by  Ac  Such  models  are  called  item- 
response  models.  The  logistic  form  (13.4)  is  called  the  Rasch  model  (Rasch  1961).  In 
estimating  (A/}>  Rasch  treated  [h, }  as  fixed  effects  and  used  conditional  ML,  as  outlined  in 
Section  1 1.2.3  for  matched  pairs.  Later  authors  used  the  normal  random  effects  approach 
for  this  model  and  sometimes  the  corresponding  model  with  probit  link  (Bock  and  Aitkin 
1981). 

The  { A? }  in  the  Rasch  model  differ  from  parameters  in  corresponding  marginal  models 
such  as  (1 1.33),  which  is 


logit[/>(T,  =  !)]  =  «  + A„  t=l,...,T, 


(13.5) 
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since  the  Rasch  model  effects  are  subject-specific.  The  Rasch  model  refers  to  a  T  x  2  x  n 
table  of  observation-by-outcome-by-subject,  whereas  the  marginal  model  refers  to  the  7x2 
observation-by-outcome  table  of  the  T  marginal  distributions,  collapsed  over  subjects.  For 
observations  5  and  t  for  a  given  subject  i  with  model  (13.4), 


A  -  A  =  logit [P(Yb  =  1  \  Uj )]  -  logit[/>(T„  =  1  Ilf/)]. 


which  is  a  log  odds  ratio  conditional  on  the  subject.  By  contrast,  the  corresponding 
population-averaged  effect  in  marginal  model  (13.5)  is 

A  -  A  =  logit[/>(Tfc  =  1)]  -  logit[/>(T„  =  1)], 

with  subject  h  randomly  selected  for  observation  s  and  subject  i  randomly  selected  for 
observation  t  (i.e.,  h  and  i  are  independent  observations). 


13.1.5  Random  Effects  Versus  Conditional  ML  Approaches 

Suppose  we  treat  {«,  }  in  model  (13.4)  as  fixed  effects  instead  of  random  effects  and  use 
ordinary  ML  to  estimate  (A }  and  [u, }.  As  n  increases,  so  does  the  number  of  parameters, 
since  each  cluster  has  a  u,.  Even  though  the  number  of  (A)  does  not  increase  as  n  does, 
the  ordinary  ML  estimators  (A)  are  not  consistent.  This  happens  in  many  models  when 
the  number  of  parameters  has  size  on  the  same  order  as  the  number  of  observations. 
Asymptotic  optimality  properties  of  ML  estimators,  such  as  consistency,  require  the  number 
of  parameters  to  be  fixed  as  n  increases.  For  model  (13.4),  ML  estimators  of  { A }  have  bias 
of  order  T/(T  —  1)  (Andersen  1980,  pp.  244-245).  For  the  matched-pairs  model  (13.2), 
for  instance,  yS  — 2yS  in  probability  (Exercise  1 1.29). 

For  this  reason,  the  preferable  approach  for  the  fixed  effects  model  is  conditional  ML, 
eliminating  {«, }  by  conditioning  on  their  sufficient  statistics  {S,  =  y„,  i  =  1,  . . . ,  n }. 

In  the  item-response  context,  these  are  the  numbers  of  correct  responses  for  each  subject. 
Conditional  on  {5, },  the  distribution  of  (y„)  is  independent  of  {w, }.  Maximizing  the  resulting 
likelihood  then  yields  consistent  estimators  of  (A)-  The  analysis  generalizes  the  one  in 
Section  1 1.2.3  for  the  subject-specific  logistic  model  (1 1.8)  for  matched  pairs. 

Compared  with  the  random  effects  approach,  the  conditional  ML  approach  has  cer¬ 
tain  advantages.  It  is  not  necessary  to  assume  a  parametric  distribution  for  {«,}.  It  is 
difficult  to  check  this  assumption  in  the  random  effects  approach.  Conditional  ML  is 
also  appropriate  with  retrospective  sampling.  In  that  case,  bias  can  occur  with  a  random 
effects  approach  because  the  clusters  are  not  randomly  sampled  (Neuhaus  and  Jewell 
1990b). 

However,  the  conditional  ML  approach  has  severe  disadvantages.  It  is  restricted  to  the 
canonical  link  (the  logit),  for  which  reduced  sufficient  statistics  exist  for  {w,}.  Also,  as 
discussed  in  Section  1 1.2.7,  it  is  restricted  to  inference  about  within-cluster  fixed  effects. 
The  conditioning  removes  the  source  of  variability  needed  for  estimating  between-cluster 
effects  in  models  with  explanatory  variables  such  as  those  considered  next.  Also,  this 
approach  does  not  provide  information  about  {»,},  such  as  predictions  of  their  values 
and  estimates  of  their  variability  or  of  the  probabilities  they  determine.  Finally,  in  more 
general  models  with  covariates,  conditional  ML  can  be  less  efficient  than  the  random  effects 
approach  for  estimating  the  fixed  effects  (see  Note  13.2). 
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13.2  BINARY  RESPONSES:  LOGISTIC-NORMAL  MODEL 

The  item-response  model  ( 1 3.4)  with  random  intercept  is  a  special  case  of  an  important  class 
of  random  effects  models  for  binary  data  called  logistic-normal  models.  With  univariate 
random  effect,  the  model  form  is 

logit  [P(K„  =  1  \uj )]  =  xJ,P  +  Uj,  (13.6) 

where  { m,  }  are  independent  N( 0,  er2)  variates.  The  logistic-normal  model  has  a  long  history, 
dating  at  least  to  Cox  (1970,  Prob.  20  in  that  text)  for  the  matched-pairs  model  (13.3)  and 
Pierce  and  Sands  (1975). 

13.2.1  Shared  Random  Effect  Implies  Nonnegative  Marginal  Correlations 

More  generally,  the  link  function  in  the  random  intercept  model  (13.6)  can  be  an  arbitrary 
inverse  cdf.  For  such  models,  Yis  and  Yn  are  treated  conditionally  (given  «,  )  as  independent. 
The  model  implies  that  they  are  marginally  nonnegatively  correlated.  Let  <t>  denote  the  cdf 
that  is  the  inverse  link  function.  Then,  for  s  ^  /, 


cov(Y„,  Yit)  =  E[cov(T,s,  T„|w,)]  +  cov[£(Y,- s\u,),  E(Yit\uj)] 

=  0  +  cov  [<t>  (x[sfi  +  Uj) ,  (xf'0  +  u,)] .  (13.7) 

The  functions  in  the  last  covariance  term  are  both  monotone  increasing  in  «, ,  and  hence  are 
nonnegatively  correlated. 

When  the  predictor  value  jc„  is  the  same  for  each  t,  the  marginal  distribution  implied 
by  the  model  is  exchangeable  among  components  of  a  cluster.  This  is  often  plausible.  In 
longitudinal  studies,  however,  observations  closer  together  in  time  may  tend  to  be  more 
highly  correlated. 

Usually,  the  main  focus  in  using  a  GLMM  is  inference  about  the  fixed  effects.  The  random 
effects  part  of  the  model  is  a  mechanism  for  representing  how  the  positive  correlation  occurs 
between  observations  within  a  cluster.  Parameters  pertaining  to  the  random  effects  may 
themselves  be  of  interest,  however.  For  instance,  the  estimate  d  of  the  standard  deviation 
of  a  random  intercept  summarizes  the  degree  of  heterogeneity  of  a  population. 

13.2.2  Interpreting  Heterogeneity  in  Logistic-Normal  Models 

When  o=0,  the  logistic-normal  model  (13.6)  simplifies  to  the  ordinary  logistic  regression 
model  treating  all  observations  as  independent.  When  o  >  0,  how  can  we  interpret  the 
variability  in  effects  that  this  model  implies? 

Consider  observation  y\,  at  setting  x„  of  predictors  and  observation  at  setting  x/,v. 
Their  log  odds  ratio  is 

logit[P(T„  =  1  |n,-)]  -  logit[/5(Tfa  =  \  \uh)]  =  (x„  -  xhs)T  0  +  (u,  -  uh). 

We  cannot  observe  (m,  —  «/,),  which  has  a  N( 0,  2ct2)  distribution.  However,  100(1  —  a)% 
of  those  log  odds  ratios  fall  within 

(x„  -  xhsf  0  ±  za/2V2o. 


(13.8) 
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When  o  =  0.  (x„  —  x/,s)T  fi  is  the  usual  form  of  log  odds  ratio  for  a  model  without 
random  effects.  When  a  >  0,  (x„  —  x/,s)T  fi  is  the  log  odds  ratio  for  two  observations  in 
the  same  cluster  (h  =  i)  or  with  the  same  random  effect  value.  Suppose  that  xir  —  x/,s  for 
observations  from  different  clusters.  Then,  since  z0.25  =  0.674,  the  middle  50%  of  the  log 
odds  ratios  fall  within  ±§.61A\/2o  =  ±0.95  a .  Hence,  the  median  odds  ratio  between  the 
observation  with  higher  random  effect  and  the  observation  with  lower  random  effect  equals 
exp(0.95<r).  With  a  single  predictor  and  xit  —  xi,s  —  1,  the  median  such  odds  ratio  equals 
exp(/f  ±  0.95tr).  Larsen  et  al.  (2000)  presented  related  interpretations. 

13.2.3  Connections  Between  Random  Effects  Models  and  Marginal  Models 

The  fixed  effects  parameters  /?  in  GLMMs  have  conditional  interpretations,  given  the 
random  effect.  Those  fixed  effects  are  of  two  types.  First,  consider  an  explanatory  variable 
that  varies  in  value  among  observations  in  a  cluster.  For  instance,  in  a  crossover  study 
comparing  T  drugs,  for  each  subject  the  drug  taken  varies  from  observation  to  observation 
in  that  subject’s  cluster  of  T  observations.  For  such  an  explanatory  variable,  its  coefficient 
in  the  model  refers  to  the  effect  on  the  response  of  a  within-cluster  (e.g.,  subject-specific) 
1-unit  increase  of  that  predictor.  The  random  effect  as  well  as  other  explanatory  variables 
in  the  model  are  constant  while  that  predictor  increases  by  1 .  The  effect  of  that  explanatory 
variable  is  a  “within-cluster”  (e.g.,  within-subject)  one. 

Second,  consider  an  explanatory  variable  with  constant  value  among  observations  in  a 
cluster.  An  example  is  gender  when  each  cluster  is  an  individual.  For  such  an  explanatory 
variable,  its  coefficient  refers  to  the  effect  on  the  response  of  a  “between-cluster”  1  -unit 
increase  of  that  predictor.  An  example  is  a  comparison  of  females  and  males  using  an 
indicator  variable  and  its  coefficient.  However,  this  fixed  effect  in  the  GLMM  applies  only 
when  the  random  effect  (as  well  as  other  explanatory  variables  in  the  model)  takes  the  same 
value  in  both  groups:  for  instance,  a  male  and  a  female  with  the  same  random  effect  values. 

It  is  in  this  sense  that  random  effects  models  are  cluster-specific  models,  as  both  within- 
and  between-cluster  effects  apply  conditional  on  the  random  effect  value.  By  contrast, 
effects  in  marginal  models  are  averaged  over  all  clusters  (i.e.,  population  averaged),  so 
those  effects  do  not  refer  to  a  comparison  at  a  fixed  value  of  a  random  effect.  In  fact, 
a  fundamental  difference  between  the  two  model  types  is  that  when  the  link  function  is 
nonlinear,  such  as  the  logit,  the  population-averaged  effects  of  marginal  models  often  are 
smaller  in  absolute  value  than  the  cluster-specific  effects  of  GLMMs. 

Specifically,  the  GLMM  (13.1)  refers  to  the  conditional  mean,  /r„  =  E(Yj,\ui).  By 
inverting  the  link  function. 


E(Yit\u,)  =  g  '  (xJ,P  +  zJ,Ui)  . 

Marginally,  averaging  over  the  random  effects,  the  mean  is 

E(Yit)  =  E[E(Yj,\uj)]  =  J  g~'  {xJ,P  +  zj, «,)/(«,;  £)</«,, 

where  /(«;  E)  is  the  /V(0,  E)  density  function  for  the  random  effects.  For  the  identity 
link. 


E(Y,,)  =  j  (x],P  ±  z]tUi)  f(ur,T)dUj  =  x]tp. 
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The  marginal  model  has  the  same  model  form  and  effects  /?.  This  is  not  true  for  other  links. 
For  instance,  for  the  logistic-normal  model  (13.6), 


E{Ylt)  =  E 


exp  (xlP  +  Uj) 

1  +  exp  (x,v  +  w,j 


This  expectation  does  not  have  form  exp(x^/?)/[  1  +  e\p(xft  /?)]  except  when  «,  has  a 
degenerate  distribution  (a  =  0). 

Approximate  relationships  exist  between  effects  from  the  two  model  types.  In  the 
logistic-normal  case  with  effect  ft  and  small  a,  Zeger  et  al.  (1988)  showed  that 

E(Yi,)^e\ p(cx£/S)/[l  +exp  (cxjtp)],  (13.9) 

where  c  =  [1  +  0.346<'r2]~,/^.  Since  the  effect  in  the  marginal  model  multiplies  that  of  the 
random  effects  model  by  about  c,  it  is  smaller  in  absolute  value.  The  discrepancy  increases 
as  a  increases.  For  j8  near  0,  Neuhaus  et  al.  (1991)  showed  that  the  marginal  model 
effect  is  approximately  >3(1  —  p),  where  p  =  corr(T„,  T„)  at  fi  —  0.  Again,  the  discrepancy 
increases  as  a  increases,  since  p  increases  with  cr. 

For  Table  13.1  on  voting  in  2004  and  in  2008,  the  ML  estimate  for  model  (13.3)  is 
$  =  1.216,  with  a  =  5.22  for  variability  of  {«,■}.  Approximation  (13.9)  suggests  that  this 
corresponds  to  a  marginal  estimate  of  about  [1  +  0.346(5. 22)2]~1/2(1 .216)  =  0.377.  The 
actual  marginal  estimate  is  the  log  odds  ratio  for  the  sample  marginal  distributions,  equaling 


log[(229/204)/(  191/242)]  =0.35. 

In  fact,  the  marginal  effect  (0.35)  is  much  smaller  than  the  subject-specific  effect  (1.216). 
At  yS  =  0,  the  fit  of  the  model  is  that  of  the  symmetry  model,  for  which  A12  =  A21  = 
(ft  12  +  «2i)/2.  The  correlation  for  that  2x2  table  equals  0.676,  from  which  the  subject- 
specific  estimate  of  1.216  and  the  Neuhaus  et  al.  (1991)  approximation  suggests  a  marginal 
estimate  of  about  1.216(1  —  0.676)  =  0.39,  also  not  far  from  the  actual  value  of  0.35. 

Figure  13.1  illustrates  why  the  marginal  effect  is  smaller  than  the  subject-specific 
effect.  For  a  single  explanatory  variable  x,  the  figure  shows  subject-specific  curves  for 


P(Y=1) 


Figure  13.1  Logistic  random-intercept  model,  showing  its  subject-specific  curves  and  the  population-averaged 
marginal  curve  averaging  over  these. 
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P(Yj,  =  1 1 Uj)  for  several  subjects  when  considerable  heterogeneity  exists.  This  corresponds 
to  a  relatively  large  a  for  random  effects.  At  any  fixed  value  of  x,  variability  occurs  in  the 
conditional  means,  £(T,-,|m,)  =  P(Y„  =  1 1 «,•).  The  average  of  these  is  the  marginal  mean, 
E(Yjt).  These  averages  for  various  x  values  yield  the  superimposed  dashed  curve.  It  has  a 
shallower  slope.  In  fact,  it  does  not  exactly  follow  the  logistic  formula. 

Similar  remarks  apply  to  other  GLMMs.  For  the  probit  link  with  binary  data,  however, 
the  probit  model  with  normal  random  effect  does  imply  a  marginal  model  of  probit  form 
(Exercise  13.25).  With  univariate  random  intercept,  the  marginal  effect  equals  the  subject- 
specific  effect  multiplied  by  [1  +  cr2]-l/2  (Zeger  et  al.  1988,  Caffo  and  Griswold  2006). 
The  marginal  model  is  also  of  probit  form  when  the  random  effects  have  a  mixture  of 
normal  distributions  (Caffo  et  al.  2007). 

13.2.4  Comments  About  GLMMs  Versus  Marginal  Models 

GLMMs  with  random  effects  describe  cluster-specific  effects,  whereas  marginal  models 
describe  population-averaged  effects.  Some  statisticians  prefer  one  of  these  types,  but  most 
feel  that  both  are  useful,  depending  on  the  application. 

The  random  effects  modeling  approach  is  preferable  if  we  want  to  specify  a  mechanism 
that  could  generate  positive  association  among  clustered  observations,  estimate  cluster- 
specific  effects,  estimate  their  variability,  or  model  the  joint  distribution.  Latent  variable 
constructions  used  to  motivate  model  forms  (e.g.,  for  binary  data,  the  tolerance  model  and 
the  related  threshold  and  utility  models  of  Section  7.1.1)  apply  more  naturally  at  the  cluster 
level  than  at  the  marginal  level.  Given  a  random  effects  model,  we  can  recover  information 
about  marginal  distributions.  That  is,  a  random  effects  model  implies  a  marginal  model, 
but  a  marginal  model  does  not  itself  imply1  a  random  effects  model. 

In  many  surveys  or  epidemiological  studies,  a  goal  is  to  compare  the  relative  frequency 
of  occurrence  of  some  outcome  for  different  groups  in  a  population,  such  as  smokers  and 
nonsmokers.  Then,  quantities  of  primary  interest  include  between-group  odds  ratios  among 
marginal  probabilities  for  the  different  groups.  That  is,  effects  of  interest  arebetween-cluster 
rather  than  within-cluster.  When  marginal  effects  are  the  main  focus,  it  is  simpler  and  often 
preferable  to  model  the  margins  directly.  We  can  then  parameterize  the  model  so  that 
regression  parameters  have  a  direct  marginal  interpretation.  Developing  a  more  detailed 
model  of  the  joint  distribution  that  generates  those  margins,  as  a  random  effects  model 
does,  provides  greater  opportunity  for  misspecification.  For  instance,  the  assumptions  that 
observations  in  a  cluster  are  independent,  given  the  random  effect,  and  that  the  random 
effects  have  constant  variance  throughout  the  space  of  explanatory  variable  values,  need 
not  be  valid. 

In  Section  13.2.3  we  noted  that  cluster-specific  effects  are  usually  larger  in  magnitude 
than  marginal  effects,  and  the  discrepancy  increases  as  variance  components  increase. 
Usually,  though,  the  significance  of  an  effect  is  similar  in  the  two  model  types.  If  one  effect 
seems  more  important  than  another  in  a  random  effects  model,  the  same  is  usually  true 
with  a  marginal  model.  So  the  choice  of  the  model  is  usually  not  crucial  to  inferential 
conclusions. 

This  statement  requires  a  caveat,  however,  since  sizes  of  effects  in  marginal  models 
depend  on  the  degree  of  heterogeneity  in  random  effects  models.  In  comparing  effects 
for  two  groups  or  two  variables  that  have  quite  different  variance  components,  relative 

1  Although  see  Note  13.12  for  an  implicit  connection. 
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sizes  of  effects  will  differ  for  marginal  and  random  effects  models.  Section  13.3.8  shows 
an  example.  From  approximation  (13.9),  the  attenuation  from  the  random  effects  to  the 
marginal  effect  will  tend  to  be  greater  for  the  group  having  the  larger  variance  component. 
For  instance,  suppose  that  two  groups,  one  young  in  age  and  the  other  elderly,  both  show  the 
same  subject-specific  effect  in  a  crossover  study  comparing  two  drugs.  If  the  elderly  group 
has  more  heterogeneity  in  their  response  propensities,  their  marginal  effect  may  be  smaller 
than  that  for  the  younger  group.  The  marginal  effects  differ  even  though  the  subject-specific 
effects  are  the  same,  because  of  the  greater  variance  component  for  the  elderly.  In  such 
cases,  the  subject-specific  effect  (appropriately  modeled)  may  have  more  relevance. 


13.3  EXAMPLES  OF  RANDOM  EFFECTS  MODELS  FOR  BINARY  DATA 

In  the  next  three  sections  we  present  a  variety  of  examples  of  random  effects  models.  In 
this  section  we  present  models  for  binary  responses. 

13.3.1  Example:  Small- Area  Estimation  of  Binomial  Proportions 

Small-area  estimation  refers  to  estimation  of  parameters  for  a  large  number  of  geographical 
areas  when  each  has  relatively  few  observations.  Examples  are  county-specific  estimates 
of  characteristics  such  as  the  proportions  of  people  unemployed,  living  below  the  poverty 
level,  and  not  having  health  insurance  coverage.  With  a  national  or  statewide  survey,  some 
counties  may  have  few  observations.  Then,  sample  proportions  in  the  counties  may  poorly 
estimate  the  true  countywide  proportions.  Random  effects  models  that  treat  each  county 
as  a  cluster  can  provide  improved  estimates.  In  assuming  that  the  true  proportions  vary 
according  to  some  distribution,  the  fitting  process  “borrows  from  the  whole” — it  uses  data 
from  all  the  counties  to  estimate  the  proportion  in  any  given  one. 

Let  7i i  denote  the  true  proportion  in  area  i,  i  =  1 .  These  areas  may  be  all  the 
ones  of  interest,  or  only  a  sample.  Let  {y, }  denote  independent  bin(7j  ,  tt,)  variates;  that 

is,  y*  —  £,=i  yih  where  {y„,  t  =  1 . 7/}  are  independent  with  P{Yj,  —  1)  =  7r,  and 

P(Y„  =  9)  =  1  —7 Tj.  The  sample  proportions  {p,  =  y,  /  T, }  are  ML  estimates  of  {tt,  }  for 
the  fixed  effects  model 


logit(7Tj )  =  a  +  flj ,  i  =  l,...,/t. 

This  model  is  saturated,  having  n  nonredundant  parameters  (with  a  constraint  such  as 
P\  =  0)  for  the  n  binomial  observations. 

For  small  {7}},  { p, }  have  large  standard  errors.  Thus,  {/>,}  may  display  much  more 
variability  than  {77,  },  especially  when  { 7r ,}  are  similar.  Then,  it  is  helpful  to  shrink  {/?;} 
toward  their  overall  mean.  We  can  accomplish  this  with  the  random  effects  model 


logit[/5(T„  =  1  \Uj)]  =  a  +  (13.10) 

where  {«, }  are  independent  N(0,  a2)  variates.  This  model  is  a  logistic  analog  of  one-way 
random  effects  ANOVA.  When  o  =  0,  all  tt,  are  identical.  For  this  model. 

if/  =  exp(a  +  «,)/[  1  -I-  exp(a  +  «,  )]. 

The  predicted  random  effect  u,  is  the  estimated  mean  of  the  distribution  of  u,,  given  the 
data  (Section  13.6.6).  This  prediction  depends  on  all  the  data,  not  just  data  from  area  i.  A 
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benefit  is  potential  reduction  in  the  mean  squared  error  of  the  estimates  around  [n,  },  using 
{if/}  instead  of  {/?,}. 

The  random  effects  model  estimate  7f,  can  differ  substantially  from  the  sample  proportion 
Pi.  For  example,  if  a  —  0,  then  all  u,  =  0.  The  random  effects  model  estimate  is  then 
i Ti  =  (£"_ |  L,  y«)/(J2i  Ti),  which  is  the  overall  sample  proportion  after  pooling  all  n 
samples.  When  truly  all  7r,  are  equal,  this  is  a  much  better  estimator  of  that  common  value 
than  the  sample  proportion  from  a  single  sample.  Generally,  the  random  effects  model 
estimators  shrink  the  separate  sample  proportions  toward  the  overall  sample  proportion. 
The  amount  of  shrinkage  decreases  as  a  increases.  The  shrinkage  also  decreases  as  the 
{Tj}  grow.  As  each  sample  has  more  data,  we  put  more  trust  in  the  separate  sample 
proportions. 

We  illustrate  model  (13.10)  with  a  simulated  sample  of  2000  people  to  mimic  a  poll  taken 
before  the  2008  U.S.  presidential  election.  For  T,  observations  in  state  i  ( i  =  1,  . . . ,  51, 
where  i  =  51  is  District  of  Columbia  =  DC),  y,  is  bin(T,  ,  7r(  ),  where  TCj  is  the  actual 
proportion  of  votes  in  state  i  for  Barack  Obama  in  the  2008  election.  Here,  we  took  7j 
proportional  to  the  number  of  people  in  state  i  who  voted  in  that  election,  subject  to 
Ti  =  2000.  Table  13.2  shows  [T, ),  {rr,),and  {p,  =  y,/7}(. 


Table  13.2  Estimates"  of  Proportion  of  Vote  for  Obama  in  2008  U.S.  Presidential  Election, 
Based  on  Sample  Size  T,  in  State  i 


State 

T 

7Ii 

Pi 

Ai 

State 

T 

7Ti 

Pi 

Tti 

AK 

5 

0.379 

0.600 

0.524 

MT 

7 

0.471 

0.429 

0.489 

AL 

29 

0.387 

0.310 

0.398 

NC 

66 

0.497 

0.348 

0.390 

AR 

17 

0.389 

0.1 18 

0.344 

ND 

5 

0.445 

0.400 

0.488 

AZ 

35 

0.449 

0.371 

0.425 

NE 

12 

0.416 

0.833 

0.618 

CA 

207 

0.609 

0.623 

0.612 

NH 

1 1 

0.541 

0.273 

0.43 1 

CO 

37 

0.537 

0.432 

0.461 

NJ 

59 

0.571 

0.542 

0.533 

CT 

25 

0.606 

0.560 

0.535 

NM 

13 

0.569 

0.538 

0.519 

DC 

4 

0.925 

1.000 

0.580 

NV 

13 

0.552 

0.467 

0.491 

DE 

6 

0.619 

0.667 

0.541 

NY 

1 16 

0.629 

0.664 

0.638 

FL 

128 

0.509 

0.570 

0.561 

OH 

87 

0.514 

0.552 

0.543 

GA 

60 

0.469 

0.450 

0.466 

OK 

22 

0.344 

0.182 

0.350 

HI 

7 

0.718 

0.857 

0.589 

OR 

28 

0.568 

0.500 

0.503 

IA 

23 

0.539 

0.391 

0.449 

PA 

92 

0.545 

0.576 

0.562 

ID 

10 

0.359 

0.100 

0.385 

RI 

7 

0.629 

0.857 

0.589 

IL 

84 

0.618 

0.536 

0.530 

SC 

29 

0.449 

0.310 

0.398 

IN 

42 

0.498 

0.476 

0.487 

SD 

6 

0.448 

0.667 

0.541 

KS 

19 

0.415 

0.421 

0.468 

TN 

40 

0.418 

0.425 

0.455 

KY 

28 

0.412 

0.357 

0.425 

TX 

123 

0.436 

0.496 

0.498 

LA 

30 

0.399 

0.367 

0.428 

UT 

15 

0.342 

0.667 

0.570 

MA 

47 

0.618 

0.426 

0.452 

VA 

57 

0.526 

0.491 

0.496 

MD 

40 

0.619 

0.725 

0.645 

VT 

5 

0.675 

0.800 

0.560 

ME 

1 1 

0.577 

0.818 

0.607 

WA 

46 

0.573 

0.674 

0.618 

MI 

76 

0.573 

0.553 

0.542 

WI 

45 

0.562 

0.578 

0.554 

MN 

44 

0.541 

0.500 

0.503 

WV 

1  1 

0.425 

0.545 

0.520 

MO 

45 

0.492 

0.556 

0.539 

WY 

4 

0.325 

0.500 

0.506 

MS 

20 

0.430 

0.550 

0.527 

“7Tj ,  True;  p,.  sample;  tt,  ,  estimate  using  random  effects  model. 
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For  the  ML  fit  of  model  (13. 10),  a  =  0.030  and  <7  =  0.41 9.  The  predicted  random  effect 
values  yield  the  proportion  estimates  {if, },  also  shown  in  Table  13.2.  Since  {T, }  are  mostly 
small  and  since  a  is  relatively  small,  considerable  shrinkage  of  these  estimates  occurs 
from  the  sample  proportions  toward  the  overall  proportion  of  Obama  voters  for  these  2000 
people  sampled,  which  was  0.5245.  The  (jf,  {  vary  between  0.344  (for  Arkansas)  and  0.645 
(for  Maryland),  whereas  the  sample  proportions  vary  between  0.100  (for  Idaho)  and  1.0 
(for  DC).  Sample  proportions  based  on  fewer  observations,  such  as  DC,  tended  to  shrink 
more.  The  random  effects  model  estimates  tend  to  be  closer  than  the  sample  proportions 
to  the  true  values.  The  root  mean  square  error  about  the  true  proportions  is  0.038  for  the 
model-based  estimates  and  0.069  for  the  sample  proportions. 

13.3.2  Modeling  Repeated  Binary  Responses:  Attitudes  About  Abortion 

In  Section  13.1.4  we  introduced  a  random  effects  version  of  the  Rasch  model  for  repeated 
binary  measurement.  This  model  extends  to  incorporate  covariates. 

We  illustrate  using  Table  13.3.  The  subjects  indicated  whether  they  supported  legalizing 
abortion  in  each  of  three  situations.  Table  13.3  also  classifies  the  subjects  by  gender.  Let  y,, 
denote  the  response  for  subject  i  in  situation  t,  with  y,-,  =  1  representing  support.  Consider 
the  model 


logit[P(T/f  =  1|«/)]  =  a  +  p,  +yxj  +«/,  (13.11) 

where  x,  =  1  for  females  and  0  for  males,  {«,}  are  independent  (V(0,  a 2),  and  \fi,}  satisfy 
a  constraint  such  as  p\  —  0.  Here,  the  gender  effect  y  is  assumed  to  be  identical  for  each 
situation,  and  {p, }  refer  to  the  situations. 

Since  model  (13.1 1)  implies  nonnegative  association  among  responses  in  the  various 
situations,  we  should  use  questions  and  scales  for  which  this  happens.  With  scale  (yes,  no), 
it  would  not  be  sensible  for  one  question  to  ask  “Should  abortion  be  legal  when  a  woman  is 
not  married?”  and  another  to  ask  “Should  abortion  be  illegal  during  the  last  three  months 
of  pregnancy?” 

Table  13.4  summarizes  ML  fitting  results.  The  contrasts  of  {/?,}  indicate  greater  support 
for  legalized  abortion  in  situation  1  (when  the  family  has  low  income)  than  in  the  other 
two.  There  is  slight  evidence  of  greater  support  in  situation  2  than  in  situation  3.  The  fixed 
effects  estimates  have  log  odds  ratio  interpretations,  within-subject  for  situation  effects  and 
between-subject  for  the  gender  effect.  For  a  given  subject  of  either  gender,  for  instance, 
the  estimated  odds  of  supporting  legalized  abortion  in  situation  1  equal  exp(0.835)  =  2.30 
times  the  estimated  odds  in  situation  3.  Since  y  =  0.013,  for  each  situation  the  estimated 


Table  13.3  Support  for  Legalizing  Abortion  in  Three  Situations,  by  Gender 


Sequence  of  Responses  in  Three  Situations" 


Gender 

(1,1,1) 

(U,0) 

(0,1,1) 

(0,1,0) 

(1,0,1) 

(1,0,0) 

(0,0,1) 

(0,0,0) 

Male 

342 

26 

6 

21 

1 1 

32 

19 

356 

Female 

440 

25 

14 

18 

14 

47 

22 

457 

"Situations  are  ( I )  if  the  family  has  a  very  low  income  and  cannot  afford  any  more  children.  (2)  when  the  woman 
is  not  married  and  does  not  want  to  marry  the  man.  and  (3)  when  the  woman  wants  it  for  any  reason.  1 ,  yes:  0.  no. 
Source:  Data  from  General  Social  Survey. 
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Table  13.4  Summary  of  ML  Estimates  for  Generalized  Linear  Mixed  Model  ( 13.1 1 )  for 
Table  13.3,  and  ML  and  GEE  Estimates  for  Corresponding  Marginal  Model 


GLMM  ML  Marginal  ML  Marginal  GEE 


Effect 

Parameter 

Estimate 

SE 

Estimate 

SE 

Estimate 

SE 

Abortion 

/Si  -  ft 

0.835 

0.160 

0.148 

0.030 

0.149 

0.030 

Pi -Pi 

0.542 

0.157 

0.098 

0.027 

0.097 

0.028 

Pi  ~  Pi 

0.292 

0.157 

0.049 

0.027 

0.052 

0.027 

Gender 

y 

0.013 

0.490 

0.005 

0.088 

0.003 

0.088 

Vvar(n,)  a  8.74  0.54 


probability  of  supporting  legalized  abortion  is  similar  for  females  and  males  having  the 
same  random  effect  values. 

For  these  data,  subjects  are  highly  heterogeneous  (d  =  8.74).  Thus,  strong  associations 
exist  among  responses  for  the  three  situations.  This  is  reflected  by  1595  of  the  1850  subjects 
making  the  same  response  on  all  three:  that  is,  response  patterns  (0,  0,  0)  and  ( 1 ,  1,  1 ).  This 
implies  tremendous  variability  in  between-subject  odds  ratios.  From  (13.8),  for  different 
subjects  of  a  given  gender,  the  middle  50%  of  odds  ratios  comparing  situations  1  and  3  are 
estimated  to  vary  between  about  exp(0.835  —  0.95  x  8.74)  and  exp(0.835  +  0.95  x  8.74). 

For  such  contingency  table  data,  finding  cell  fitted  values  requires  integrating  over  the 
estimated  random  effects  distribution  to  obtain  estimated  marginal  probabilities  of  any 
particular  sequence  of  responses.  For  the  ML  parameter  estimates,  the  probability  of  a 
particular  sequence  of  responses  (y,  t, . . . ,  >„)  for  a  given  w,  is  the  appropriate  product 
of  conditional  probabilities,  [~If  PWit  =  yuWi),  since  the  responses  are  assumed  to  be 
independent  given  w,  .  Integrating  this  product  probability  with  respect  to  w,  forthe  N( 0,  a2) 
distribution  estimates  the  marginal  probability  for  a  given  cell  (averaged  over  subjects). 
This  requires  numerical  integration  methods  described  in  Section  13.6.  Multiplying  this 
marginal  probability  of  a  given  sequence  by  the  sample  size  for  that  multinomial  gives  a 
fitted  value.  For  instance,  of  the  females,  440  indicated  support  underall  three  circumstances 
(457  under  none  of  the  three),  and  the  fitted  value  was  436.5  (459.3). 

Overall  chi-squared  statistics  comparing  the  1 6  observed  and  fitted  counts  are  G2  =  23.2 
and  X2  =  27.8  (df  =  9).  These  are  large,  but  the  sample  size  is  very  large.  Here,  df  =  9 
since  we  are  modeling  14  multinomial  parameters  (8—1=7  for  each  gender)  using  five 
GLMM  parameters  (a,  /S2,  /S3 ,y,  cr).  An  extended  model  allows  interaction  between  gender 
and  situation.  It  has  different  {/S, }  for  men  and  women.  However,  it  does  not  fit  better.  The 
likelihood-ratio  statistic  comparing  the  models  equals  1 .0  (df  =  2). 

An  alternative  analysis  of  these  data  focuses  on  the  marginal  distributions,  treating  the 
dependence  as  a  nuisance.  A  marginal  model  analog  of  ( 13.1 1)  is 

logit[/>(T,  =  I)]  =  a  +  p,  +  yx. 

For  it,  Table  1 3.4  also  shows  the  ML  estimates  and  the  GEE  estimates  for  the  exchangeable 
working  correlation  structure.  The  marginal  model  fits  well,  with  G2  =  1.10;  here,  df  =  2 
since  the  model  describes  six  marginal  probabilities  (three  for  each  gender)  using  four 
parameters.  These  population-averaged  {/),}  are  much  smaller  than  the  subject-specific 
1/3, }  from  the  GLMM.  This  reflects  the  very  large  GLMM  heterogeneity  (a  =  8.74)  and 
the  corresponding  strong  correlations  among  the  three  responses.  For  instance,  the  GEE 
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analysis  estimates  a  common  correlation  of  0.82  between  pairs  of  responses.  Although  the 
GLMM  {/?,}  are  about  five  to  six  times  the  marginal  model  {/3, },  so  are  the  standard  errors. 
The  two  approaches  provide  similar  substantive  interpretations  and  conclusions. 

13.3.3  Example:  Longitudinal  Mental  Depression  Study  Revisited 

We  now  revisit  Table  12.1  from  a  longitudinal  study  to  compare  a  new  drug  with  a  standard 
for  treating  subjects  suffering  mental  depression.  In  Sections  12.1.1  and  12.2.2  we  analyzed 
the  data  using  marginal  models.  A  response  y,  on  mental  depression  at  time  t  equals  1  for 
normal  and  0  for  abnormal.  For  severity  of  initial  diagnosis  s  (1  =  severe,  0  =  mild),  drug 
treatment  d  ( 1  =  new,  0  =  standard),  and  t,  we  used  the  model 

logit[f,(T,  =  l)]=ff+^|j+ft^  +  ft'  +  A(f/  x  /) 

to  evaluate  the  marginal  distributions. 

Now  let  y'i,  denote  the  response  at  time  t  for  subject  /.  The  GLMM 

logit[F(T,',  =  1| « / )]  =  a  +  /3\s  +  Pid  +  Pit  +  P*(d  x/)  +  u, 

has  subject-specific  rather  than  population-averaged  effects.  Table  13.5  shows  the  ML 
estimates.  The  time  trend  estimates  of  Pi  =  0.48  for  the  standard  drug  and  Pi  +  j}4  —  1 .50 
for  the  new  drug  are  nearly  identical  to  the  ML  and  GEE  estimates  (also  shown  in  the  table) 
for  the  corresponding  marginal  model.  The  reason  is  that  the  repeated  observations  do  not 
exhibit  much  correlation,  as  the  GEE  analysis  observed.  Here,  this  is  reflected  by  a  —  0.07, 
showing  little  heterogeneity  among  subjects. 

Based  on  the  GLMM  fit,  integrating  over  the  N( 0,  0.072)  random  effects  distribution 
yields  marginal  fitted  values  of  the  possible  response  sequences.  Comparing  these  to  the 
sample  counts  in  Table  12.1  indicates  a  relatively  good  fit.  The  model  describes  the  28 
multinomial  cell  probabilities  (seven  for  the  trivariate  response  at  each  of  the  four  severity  x 
drug  combinations)  using  six  parameters.  The  fit  statistics  comparing  the  observed  cell 
counts  to  their  fitted  values  are  G 2  =  22.0  and  X 2  =  20.8  (df  =  28  —  6  =  22). 

The  deviance  increases  by  only  0.001  when  we  constrain  cr  =  0.  From  results  to  be 
discussed  in  Section  13.6.5,  the  F-value  for  comparing  models  is  half  what  one  gets  by 
treating  the  deviance  as  chi-squared  with  df  =  1 .  So,  P  =  0.49.  This  simpler  model,  which 
gives  nearly  identical  effect  estimates  and  SE  values,  is  adequate.  This  is  also  suggested  by 
AIC  values  (e.g.,  PROC  NLMIXED  in  SAS  reports  1 173.9  for  the  GLMM  and  1 171.9  for 
the  simpler  model  with  a  =  0). 


Table  13.5  Model  Parameter  Estimates  for  Marginal  Model  and  Random  Effects  Logistic 
GLMM  Fitted  to  Table  12.1 


Parameter 

Marginal  ML 

Marginal  GEE 

GLMM  ML 

Estimate 

SE 

Estimate 

SE 

Estimate 

SE 

Diagnosis 

-1.29 

0.14 

-1.31 

0.15 

-1.32 

0.15 

Drug 

-0.06 

0.22 

-0.06 

0.23 

-0.06 

0.22 

Time 

0.48 

0.12 

0.48 

0.12 

0.48 

0.12 

Drug  x  Time 

1.01 

0.18 

1.02 

0.19 

1.02 

0.19 
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13.3.4  Example:  Capture-Recapture  Prediction  of  Population  Size 

Capture-recapture  experiments  use  a  series  of  samples  to  estimate  the  size  of  a  population. 
Such  experiments  have  traditionally  been  used  to  estimate  animal  abundance  in  some 
habitat.  At  each  sampling  occasion,  animals  are  captured  and  marked  in  some  manner.  The 
animals  captured  for  any  given  sample  are  freed  and  all  animals  are  candidates  for  recapture 
in  a  later  sample.  With  T  sampling  occasions,  a  2r  contingency  table  displays  the  data,  with 
scale  (captured,  not  captured)  at  each  occasion.  The  count  nji-i  is  missing  for  the  cell 
corresponding  to  noncapture  at  each  occasion.  If  we  knew  this  cell  count,  adding  it  to  the 
others  would  yield  the  population  size.  Models  specified  for  this  2r  table  use  the  2r  —  1 
observed  counts  to  fit  the  model.  The  fit  refers  to  those  2r  —  1  cells,  but  extrapolating  it 
yields  an  estimated  count  in  the  unobserved  cell.  Adding  that  to  the  total  of  the  observed 
counts  yields  an  estimate  of  population  size. 

To  illustrate,  with  T  —  2  captures,  we  observe  n M  animals  at  both  occasions,  n\2  at  the 
first  but  not  the  second,  and  n2\  at  the  second  but  not  the  first.  We  do  not  know  the  number 
n22  n°t  captured  either  time.  If  we  assumed  independence  in  the  2  x  2  table,  the  prediction 
h22  would  be  the  value  giving  an  odds  ratio  of  1.0;  but  (n\\h22)/(n\2n2\)  =  1  implies  that 
h22  =  « i2«2i /”  11-  This  yields  a  population  size  prediction  of 


N  =  n  11  +/J12  +  n2i  +/ii2«2i/nn 
=  n\+n+\/n\\  with  var  (N)  — 


n]+n+\nnn2\ 


n 


3 


(Sekar  and  Deming  1949).  The  assumption  of  independence  is  usually  unrealistic,  how¬ 
ever.  With  additional  sampling  occasions,  we  can  base  our  prediction  on  more  complex 
models. 

Table  13.6,  from  Cormack  (1989),  refers  to  a  study  having  T  =  6  consecutive  trapping 
days  for  a  population  of  snowshoe  hares.  The  study  observed  68  hares.  For  instance,  the 
table  indicates  that  3  hares  were  observed  on  the  first  day  but  on  none  of  the  other  days. 
For  simplicity,  models  for  studies  over  a  brief  time  period  assume  that  no  deaths,  births,  or 
immigration  into  the  population  occurred  during  the  study  period.  This  is  called  a  closed 
population. 

Most  methods  for  capture-recapture  treat  the  probability  of  capture  at  a  given  occasion  as 
identical  for  each  subject  (e.g.,  animal).  This  is  usually  unrealistic.  To  allow  heterogeneous 
capture  probabilities,  we  use  a  logistic  random  effects  model.  For  subject  i,  i  =  1 , N 
with  N  unknown,  let  yj  =  (y,  i , . . . ,  y„),  where  y„  =  1  denotes  capture  in  sample  t  and 
y„  =  0  denotes  noncapture.  Lacking  explanatory  variables,  we  could  use  the  Rasch-type 
model 


logit[F(T,7  =  I  |m,-)]  —  a  +  (13.12) 

where  {«, }  are  independent  iV(0,  er2),  with  a  constraint  such  as  fi\  =  0.  The  larger  the  value 
of  fi,,  the  greater  the  capture  probability  at  occasion  t.  The  larger  is  er,  the  more  heteroge¬ 
neous  are  the  capture  probabilities.  When  a  =  0  this  logistic-normal  model  simplifies  to 
mutual  independence  [i.e.,  loglinear  model  (9.6)]  for  the  2r  table. 

As  with  other  GLMMs,  integrating  the  random  effect  from  the  probability  mass  function 
°f  (J/|m/)  yields  the  likelihood  function.  We  can  consider  this  likelihood  function  and  the 
resulting  ML  estimates  of  {(},}  and  a  for  all  possible  counts  in  the  unobserved  cell.  A 
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Table  13.6  Capture-Recapture  Results  for  Snowshoe  Hares 


Capture 

6 

Capture 

5 

Capture 

4 

Capture  3,  Capture  2,  Capture  1 

000 

00  1 

0  1  0 

0  1  1 

1  00 

1  0  1 

1  1  0 

1  1  1 

0 

0 

0 

— 

3 

6 

0 

5 

1 

0 

0 

(24.0)" 

(2.3) 

(5.4) 

(0.9) 

(3.2) 

(0.5) 

(1.2) 

(0.3) 

0 

0 

1 

3 

2 

3 

0 

0 

1 

0 

0 

(4.8) 

(0.8) 

(1.8) 

(0.5) 

(1.1) 

(0.3) 

(0.6) 

(0.3) 

0 

1 

0 

4 

2 

3 

1 

0 

1 

0 

0 

(3.9) 

(0.6) 

(1.5) 

(0.4) 

(0.9) 

(0.2) 

(0.5) 

(0.2) 

0 

1 

1 

1 

0 

0 

0 

0 

0 

0 

0 

(1.3) 

(0.3) 

(0.8) 

(0.3) 

(0.5) 

(0.2) 

(0.4) 

(0.3) 

1 

0 

0 

4 

1 

1 

1 

2 

0 

2 

0 

(6.8) 

(1-D 

(2.6) 

(0.6) 

(1.5) 

(0.4) 

(0.9) 

(0.4) 

1 

0 

1 

4 

0 

3 

0 

1 

0 

2 

0 

(2.3) 

(0.6) 

(1.3) 

(0.5) 

(0.8) 

(0.3) 

(0.7) 

(0.4) 

1 

1 

0 

2 

0 

1 

0 

1 

0 

1 

0 

(19) 

(0.5) 

(1.1) 

(0.4) 

(0.7) 

(0.3) 

(0.6) 

(0.4) 

1 

1 

1 

1 

1 

1 

0 

0 

0 

1 

2 

(1.0) 

(0.4) 

(0.9) 

(0.5) 

(0.5) 

(0.3) 

(0.7) 

(0.7) 

"Fitted  values  for  logistic-normal  model;  1  =  capture,  0  = 
Source :  Coull  and  Agresti  (1999). 


noncapture. 


profile  likelihood  function  views  the  maximized  likelihood  as  a  function  of  the  unobserved 
cell  count.  The  ML  prediction  for  that  unobserved  cell  count  is  the  value  that  maximizes 
this  profile  likelihood.  Lacking  specialized  software,  we  can  fit  the  GLMM  repeatedly 
with  various  counts  in  the  unobserved  cell  to  determine  by  trial  and  error  the  count  that 
maximizes  the  likelihood  function.  ML  fitting  to  Table  13.6  yields  a  prediction  of  24.0  for 
the  unobserved  cell  count.  Since  the  study  observed  68  hares,  the  population  size  estimate 
is  N  —  92.  For  this  fit,  a  —  1 .0. 

Methods  for  obtaining  a  confidence  interval  for  N  include  using  the  profile  likelihood 
function  or  a  nonparametric  bootstrap  method.  With  the  profile  likelihood  approach,  the 
interval  for  the  missing  cell  count  consists  of  the  possible  counts  for  that  cell  such  that 
the  G 2  fit  statistic  increases  by  less  than  xj(a)  from  its  value  at  the  ML  estimate.  Adding 
the  number  of  subjects  observed  in  the  samples  to  the  endpoints  of  this  interval  gives  the 
corresponding  interval  for  N.  For  the  snowshoe  hares,  a  95%  profile  likelihood  confidence 
interval  for  N  is  (75,  1 54).  It  is  common  for  N  to  be  nearer  the  low  end  of  the  interval.  See 
Coull  and  Agresti  (1999)  for  details. 

The  greater  the  heterogeneity,  as  reflected  by  larger  d,  N  tends  to  be  larger  and  the 
confidence  interval  tends  to  be  wider.  Large  a  causes  difficulties  in  estimation,  since  it 
results  in  a  relatively  flat  likelihood  surface.  This  implies  imprecise  estimates  of  N.  In 
particular,  the  upper  limit  of  the  profile  likelihood  confidence  interval  for  A  is  essentially 
infinite  when  the  likelihood  function  gets  sufficiently  flat.  Also,  the  ML  estimator  is  then 
often  unstable,  with  small  changes  in  the  data  yielding  large  changes  in  N .  Difficulties 
can  also  arise  when  probabilities  of  capture  are  small.  Evidence  of  this  occurs  when  most 
subjects  captured  appear  in  only  one  sample.  When  this  happens  or  when  a  is  large, 
confidence  intervals  for  N  are  necessarily  very  wide. 

Alternative  models  are  discussed  in  Section  14.1.4.  Models  that  ignore  likely  hetero¬ 
geneity  can  give  unrealistically  narrow  confidence  intervals  for  N.  Although  traditionally 
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used  for  animal  populations,  capture-recapture  applications  also  include  estimating  pop¬ 
ulation  size  for  human  populations.  Darroch  et  al.  (1993)  considered  census  population 
estimation,  and  Chaoet  al.  (2001)  estimated  the  number  of  people  infected  during  a  hepati¬ 
tis  outbreak  (Exercise  13.15).  An  interesting  application  is  estimating  the  number  of  files 
on  the  World  Wide  Web  relating  to  some  subject  by  taking  samples  using  several  search 
engines  (Fienberg  et  al.  1999). 

13.3.5  Example:  Heterogeneity  Among  Multicenter  Clinical  Trials 

Many  applications  compare  two  groups  on  a  categorical  response  for  data  stratified  on  a 
third  variable.  With  binary  outcomes,  the  data  form  several  2x2  contingency  tables.  The 
main  focus  relates  to  studying  the  association  in  the  2x2  tables  and  whether  and  how  it 
varies  among  the  strata. 

The  strata  are  sometimes  themselves  a  sample,  such  as  schools  or  medical  clinics. 
A  random  effects  approach  is  then  natural.  With  a  random  sampling  of  strata,  it  enables 
inferences  to  extend  to  the  population  of  strata.  The  fit  of  the  random  effects  model  provides 
a  simple  summary  such  as  an  estimated  mean  and  standard  deviation  of  log  odds  ratios 
for  the  population  of  strata.  In  each  stratum  it  also  provides  a  predicted  log  odds  ratio  that 
shrinks  the  sample  value  toward  the  mean.  This  is  especially  useful  when  the  sample  size 
in  a  stratum  is  small  and  the  sample  log  odds  ratio  has  large  standard  error.  Even  when  the 
strata  are  not  a  random  sample  or  not  even  a  sample  and  a  random  effects  approach  is  not 
as  natural,  the  model  is  beneficial  for  these  purposes. 

We  illustrate  using  Table  13.7,  previously  analyzed  in  Section  6.4,  showing  the  results 
of  a  clinical  trial  at  eight  centers.  The  purpose  was  to  compare  an  active  drug  and  a  control, 
for  curing  a  fungal  infection.  For  a  subject  in  center  /  using  treatment  t  (1  =  active  drug; 


Table  13.7 

Clinical  Trial  Relating  Treatment  to  Response  for  Eight  Centers 

Center 

Treatment 

Response 

Sample 

Odds  Ratio 

GLMM  Fitted 
Odds  Ratio 

Success 

Failure 

1 

Drug 

1 1 

25 

1.19 

2.02 

Control 

10 

27 

2 

Drug 

16 

4 

1.82 

2.09 

Control 

22 

10 

3 

Drug 

14 

5 

4.80 

2.19 

Control 

7 

12 

4 

Drug 

2 

14 

2.29 

2.1 1 

Control 

1 

16 

5 

Drug 

6 

1 1 

oo 

2.18 

Control 

0 

12 

6 

Drug 

1 

10 

oo 

2.12 

Control 

0 

10 

7 

Drug 

1 

4 

2.00 

2.11 

Control 

1 

8 

8 

Drug 

4 

2 

0.33 

2.06 

Control 

6 

1 

Sonne:  Beitlerand  Landis  (1985). 
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2  =  control),  let  ylt  =  1  denote  success.  One  possible  model  is  the  logistic-normal, 
logit[P(T,|  =  1  |h;)]  =  a  +  p/2  +  Uj, 

logit[P(y/2  =  l|t/,)]  =  a  -  fS/2  +  Uj,  (13.13) 

where  {«,-}  are  independent  N( 0,  a2)  variates.  This  model  assumes  that  the  log  odds  ratio 
P  between  treatment  and  response  is  constant  over  centers.  The  parameter  a  summarizes 
center  heterogeneity  in  the  success  probabilities. 

A  logistic-normal  model  permitting  treatment-by-center  interaction  is 

logit[P(yn  =  1 1  Uj,  bj)]  =  a  +  (P  +  h,)/ 2  +  u 

logit[P(y,2  =  1  \ui,  h/)]  =  a-(P  +  6,)/2  +  uh  (13.14) 

where  ((«,  ,  bj))  are  independent  bivariate  normal,  with  variances  er2  and  op  and  correlation 
p.  The  log  odds  ratio  equals  p  +  bj  in  center  These  vary  among  centers  according  to 
a  N(P,  op)  distribution.  That  is,  P  is  the  expected  center-specific  log  odds  ratio  between 
treatment  and  response,  and  ah  describes  variability  in  those  log  odds  ratios.  The  model 
parameters  are  (a,  p ,  ou,  Oj,,  p). 

In  Table  13.7  the  sample  success  rates  vary  markedly  among  centers  both  for  the  control 
and  drug  treatments,  but  in  all  except  the  last  center  that  rate  is  higher  for  the  drug  treatment. 
In  using  models  with  random  center  and  random  treatment  effects,  it  is  preferable  to  have 
many  more  than  eight  centers.  It  is  difficult  to  get  reliable  variance  component  estimates 
with  so  few  centers,  and  the  asymptotics  for  estimating  the  parameters  for  this  type  of 
model  apply  as  the  number  of  centers  increases.  Keeping  this  in  mind,  we  use  these 
data  to  illustrate  the  models.  Simplifying  the  model  a  bit  by  taking  p  =  0  (justified  by 
an  ML  estimate  that  is  not  significantly  nonzero),  the  treatment  estimates  are  0  =  0.739 
( SE  =  0.300)  for  the  model  (13. 13)  of  no  interaction  and  0  —  0.746  ( SE  —  0.325)  for  the 
model  (13.14)  permitting  interaction.  Considerable  evidence  of  a  drug  effect  occurs. 

The  evidence  about  association  is  weaker  for  the  model  permitting  interaction.  The  Wald 
statistics  are  (0. 739/0. 300)2  =  6.0  for  the  no-interaction  model  and  (0.746/0. 325 )2  =  5.3 
for  the  interaction  model  (df  =  1).  The  corresponding  likelihood-ratio  statistics  are  6.3  and 
4.6.  The  extra  variance  component  in  the  interaction  model  pertains  to  variability  in  the  log 
odds  ratios.  As  its  estimate  <77,  increases,  so  does  the  SE  of  the  estimated  treatment  effect  0 
tend  to  increase.  In  this  example,  o>,  =  0. 15  is  relatively  small  and  the  standard  errors  of  0 
are  not  very  different  in  the  two  models.  When  d>,  =  0,  the  standard  errors  and  the  model 
fits  are  the  same. 

To  show  the  effect  of  larger  <77,  on  the  SE  of  the  mean  treatment  effect  estimate  0,  we 
alter  Table  13.7  slightly.  We  change  three  failures  to  successes  for  drug  in  center  3  and  three 
successes  to  failures  for  drug  in  center  8.  With  these  changes,  the  estimated  variability  of 
the  treatment  effects  increases  from  07,  =0.15  too/,  =  1.37.  The  ML  estimates  of  the  mean 
treatment  effects  are  then  0  —  0.722  (SE  —  0.299)  for  the  no-interaction  model  (13.13)  and 
0  =  0.767  (SE  =  0.623)  for  the  interaction  model.  The  Wald  statistics  are  5.84  and  1.52. 
The  evidence  of  a  treatment  effect  is  then  dramatically  weaker  for  the  interaction  model. 
Not  surprisingly,  when  the  treatment  effect  varies  substantially  among  centers,  it  is  more 
difficult  to  estimate  the  mean  of  that  effect. 

For  the  actual  data  in  Table  13.7,  because  <77,  =  0.15  for  model  (13.14)  is  relatively 
small,  the  model  shrinks  the  sample  odds  ratios  considerably.  Table  13.7  shows  the  sample 
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values  and  the  model  predicted  values.  These  are  based  on  predicting  the  random  effects 
and  substituting  them  and  the  ML  estimates  of  fixed  effects  into  the  model  formula  to 
estimate  the  two  response  probabilities  for  each  treatment  in  each  center.  The  sample  odds 
ratios  vary  from  0.33  to  oo;  their  GLMM  counterparts  vary  only  between  2.02  and  2.19. 
The  smoothed  estimates  are  much  less  variable  and  do  not  have  the  same  ordering  as  the 
sample  values.  For  instance,  the  smoothed  estimate  of  2.19  for  center  3  is  greater  than  the 
estimate  of  2.12  for  center  6,  even  though  the  sample  value  is  infinite  for  the  latter.  This 
reflects  the  greater  shrinkage  that  occurs  when  sample  sizes  are  smaller. 

13.3.6  Meta-analysis  Using  a  Random  Effects  Approach 

In  Section  6.4.6  we  discussed  ways  of  summarizing  multiple  2x2  tables,  such  as  arise  in 
meta-analyses  to  compare  two  treatments  on  a  binary  response.  The  analyses  just  shown 
for  multicenter  clinical  trials  also  apply  naturally  to  meta-analyses.  The  models  in  Section 

6.4.6  did  not  allow  for  heterogeneity  in  the  effects  or  allow  for  treating  the  studies  as  a 
random  sample  of  potential  studies.  With  random  effects  models,  we  have  a  natural  way  to 
do  both  of  these. 

The  model  (13.14)  provides  a  summary  of  variability  in  log  odds  ratios  among  studies. 
For  simpler  interpretation,  we  might  want  to  describe  variability  among  relative  risk  or 
difference  of  proportion  values.  This  can  be  done  with  analogous  models  using  log  or 
identity  links.  However,  there  are  structural  problems,  as  the  linear  predictor  can  take 
any  possible  real  number  value  when  it  contains  a  normal  random  effect.  Dersimonian 
and  Laird  (1986)  proposed  a  random  effects  approach  with  the  difference  of  proportions, 
treating  the  difference  of  proportions  as  coming  from  a  normal  distribution.  Their  method 
uses  weighted  least-squares  estimators  with  weights  based  on  sample  estimates  of  variances 
of  proportions.  This  approach,  compared  with  ML,  can  behave  poorly  for  small  samples 
with  true  proportions  near  the  boundary,  and  it  can  be  quite  biased  when  the  weights 
are  correlated  with  the  study  effect  sizes  of  interest.  Warn  et  al.  (2002)  used  a  Bayesian 
approach,  discussed  in  Section  13.7.2,  that  imposes  constraints  that  reflect  the  bounds  for 
probabilities. 

A  challenging  situation  for  meta-analyses  is  when  the  outcome  of  interest  has  very  low 
probability.  Some  tables  may  have  empty  cells  for  one  or  both  treatments.  As  discussed 
in  Section  6.4.6,  such  cases  do  not  affect  statistical  significance  but  do  provide  evidence 
about  the  magnitude  of  the  difference  of  proportions  and  its  variability.  See  Emerson  et  al. 
(1993)  and  references  in  the  Laird  et  al.  comments  about  Shuster  (2010). 


13.3.7  Alternative  Formulations  of  Random  Effects  Models 

There  are  other  ways  to  express  random  effects  models.  For  instance,  an  equivalent  expres¬ 
sion  for  interaction  model  ( 1 3 . 1 4)  is 

logit [ FIT/;  =  \  \Uj,  bit)]  =  a  +  fix,  +  bit  + 

where  x,  is  a  treatment  indicator  variable  (x\  —  1,  x^  —  0).  Here,  b,\  —  ba  corresponds  to 
bi  in  parameterization  (13.14),  and  2 here  corresponds  to  a f  in  (13.14). 

Formulating  a  random  effects  model  requires  care  about  implications  of  the  model  ex¬ 
pression  and  the  random  effects  correlation  structure.  Suppose  we  expressed  this  interaction 
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model  as 


logit[P(F,,=  \  \ui,bi)]=a  +  {p  +  bi)xt+ui,  (13.15) 

with  {/?,}  from  N{ 0,  crj;)  independent  of  {«,}  from  N{ 0,  This  is  inappropriate,  because 

the  model  then  imposes  greater  variability  for  the  logit  with  the  first  treatment  than  the 
second,  since  x2  —  0  and  [u, }  and  { bj }  are  uncorrelated.  Also,  the  model  should  not  depend 
on  the  definition  of  the  indicator  variable  x,.  Note,  however,  that  if  z,  =  x,  +  c  for  some 
constant  c,  then  model  (13.15)  is  equivalently 

logit[/5(K„  =  1  \uh  bj)]  -  a  +  (p  +  h,)(z,  -  c)  +  m,  =  a  +  (0  +  bj)z,  +  vh 

where  a'  =  a  —  c/3  and  v,-  =  //,  —  cbj.  Thus,  (v,  ,  bj)  are  correlated  even  if  («,-,  bj)  are  not.  In 
fact,  expression  (13.15)  is  sensible  only  with  correlated  random  effects.  It  is  then  equivalent 
to  (13.14)  with  correlated  random  effects. 

Rabe-Hesketh  and  Skrondal  (200 1 )  showed  that  careful  attention  must  be  paid  to  param¬ 
eter  identification  in  models  with  multivariate  random  effects.  Their/actor  model  contains 
many  multivariate  random  effects  models  as  special  cases. 

13.3.8  Example:  Matched  Pairs  with  a  Bivariate  Binary  Response 

A  sample  of  schoolboys  were  interviewed  twice,  several  months  apart,  and  asked  about 
their  self-perceived  membership  in  the  “leading  crowd”  and  about  whether  they  sometimes 
needed  to  go  against  their  principles  to  belong  to  that  group.  Thus,  there  are  two  binary 
response  variables,  which  we  refer  to  as  membership  and  attitude,  measured  at  two  interview 
times  for  each  subject.  Table  13.8  labels  the  categories  for  attitude  as  (positive,  negative), 
where  “positive”  refers  to  disagreeing  with  the  statement  that  one  must  go  against  his 
principles. 

For  subject  i,  let  y,„.  be  the  response  at  interview  time  t  on  variable  v,  where  v  —  M  for 
membership  and  v  =  A  for  attitude.  The  logistic  model 

logit[P(K,„.  =  1  |m„.)]  =  a  +  P,r  +  uiv  (13.16) 

is  a  multivariate  form  of  the  Rasch-type  model  ( 1 3.4).  It  has  additive  item  and  subject  effects 
for  each  variable  v.  Here,  (UjM •  Mm)  is  a  bivariate  random  effect  that  describes  subject 
heterogeneity  for  (membership,  attitude).  We  assume  that  the  {(t/,/w,  iUa)}  are  independent 


Table  13.8  Membership  and  Attitude  Toward  the  “Leading  Crowd” 

(M,  A)  for  Second  Interview" 
(M.  A)  for  _ _ _ 


First  Interview 

(Yes,  Positive) 

(Yes,  Negative) 

(No,  Positive) 

(No,  Negative) 

Yes,  positive 

458 

140 

110 

49 

Yes,  negative 

171 

182 

56 

87 

No,  positive 

184 

75 

531 

281 

No,  negative 

85 

97 

338 

554 

"M,  membership;  A,  attitude. 

Source:  J.  S.  Coleman.  Introduction  to  Mathematical  Sociology.  London:  Free  Press  of  Glencoe.  1 964,  p.  1 70. 
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from  a  bivariate  normal  distribution,  N( 0,  E),  with  possibly  different  variances  and  nonzero 
correlation. 

The  ML  fit  yields  02m  —  $\m  —  0.379  (SE  =  0.075)  and  02a  ~  $ia  =0.176  ( SE  = 
0.058).  For  both  variables,  the  probability  of  the  first  outcome  category  is  higher  at  the 
second  interview.  For  instance,  for  a  given  subject  the  odds  of  self-perceived  membership 
in  the  leading  crowd  at  interview  2  are  estimated  to  be  exp(0.379)  =  1.46  times  the  odds 
at  interview  1. 

The  estimated  correlation  between  the  random  effects  is  0.32.  Their  estimated  stan¬ 
dard  deviations  are  &[  =  3.08  for  {uiM}  and  ds  =  1.49  for  {«;/i}.  Since  these  are  quite 
different,  the  relative  sizes  of  membership  and  attitude  effects  differ  for  marginal  and 
random  effects  models  (recall  the  caveat  in  Section  13.2.4).  The  marginal  effect  is  atten¬ 
uated  more  for  membership.  For  this  random  effects  model,  the  ratio  of  estimated  odds 
ratios  is  exp(0.379)/  exp(0. 176)  =  1.46/1.19=  1.22.  For  the  marginal  model,  the  esti¬ 
mated  odds  ratios  use  the  marginal  distributions  of  each  variable  at  each  time  [e.g.,  this 
is  ( 1 392/2006)/(  1 253/2 1 45)  =  1.188  for  membership],  and  the  ratio  of  estimated  odds 
ratios  is  1.188/1.133  =  1.05. 

Integrating  over  the  estimated  random  effects  distribution  yields  fitted  values  for  the  16 
possible  sequences  of  responses  in  Table  13.8.  The  deviance  of  G2  =  5.5  (df  =  8)compares 
the  16  observed  counts  to  their  fitted  values.  The  model,  which  describes  15  multinomial 
probabilities  with  seven  parameters,  fits  well.  The  model  constraining  the  random  effects 
to  be  uncorrelated  fits  poorly  ( G 2  —  97.5,  df  =  9).  The  model  constraining  the  random 
effects  to  be  perfectly  correlated  is  equivalent  to  having  a  single  random  effect  u,  for  each 
subject.  The  model  is  then  a  Rasch-type  model  with  four  items  that  are  the  combinations 
of  interviews  and  variables.  That  model  fits  very  poorly  (G2  =  655.5,  df  =  10). 

13.3.9  Time  Series  Models  Using  Autocorrelated  Random  Effects 

Section  12.4  noted  that  categorical  time  series  data  can  be  modeled  with  transitional 
models  in  which  previous  response  values  as  well  as  ordinary  explanatory  variables  serve 
as  predictors  in  the  model  for  Y,.  When  a  main  purpose  is  to  describe  the  effect  of  an 
explanatory  variable  x,  on  E(Y, ),  a  disadvantage  of  such  models  is  that  the  interpretation  of 
the  /3  coefficient  of  x,  depends  on  how  many  previous  response  values  are  in  the  model.  For 
a  first-order  Markov  logistic  model,  for  instance,  />  refers  to  the  impact  on  logitf/GF,  =  1 )] 
of  a  1-unit  increase  in  x,,  but  at  a  fixed  value  of  »_| . 

In  an  alternative  approach,  Klingenberg  (2008)  proposed  GLMMs  in  which  the  serial 
dependence  is  accounted  for  by  random  effects  having  an  autoregressive  structure.  For 
binary  data,  generalizing  the  logistic-normal  model  (13.6),  he  assumed  that 

logit[/>(y„  =  1  \u,)]  =  xJ,P  +  u„  (13.17) 


where 


ut  =  put- 1  +  G,  t  =  2,  3, . . . ,  T, 

and  {€,}  are  uncorrelated  normal  variates.  He  parameterized  by  setting  u \  ~  N( 0,  a2)  and 
taking  e,  ~  N( 0,  er2[l  —  p2]),  so  that  var (ut)  =  a 2  for  all  t.  More  generally,  there  can  be  a 
varying  time  lag  d,  between  y,_i  and  yt,  in  which  case  p  is  replaced  by  pd' . 
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In  model  (13.17),  the  effect  of  an  explanatory  variable  is  conditional  on  the  random  effect 
but  not  on  previous  response  values.  ML  model  fitting  is  complex,  because  the  random  effect 
to  be  integrated  out  to  obtain  the  likelihood  function  is  7-dimensional.  Klingenberg  (2008) 
presented  a  Monte  Carlo  EM  algorithm  for  doing  this. 

13.3.10  Example:  Oxford  and  Cambridge  Annual  Boat  Race 

Klingenberg  illustrated  the  time  series  model  using  the  outcomes  of  152  races  between 
the  rowing  teams  from  Cambridge  and  Oxford,  for  the  famous  race  held  nearly  every  year 
since  1829.  Let  y,  =  1  when  Cambridge  wins  and  yt  =  0  when  Oxford  wins.  Figure  13.2 
shows  the  data,  which  are  available  at  www.  theboatrace .  org. 

A  2  x  2  table  cross-classifying  y,  by  y,_i  for  the  151  adjacent  pairs  of  races  has  counts 
48  for  (1,1),  43  for  (0,0),  and  25  for  both  (1,0)  and  (0,1),  giving  an  odds  ratio  of  3.30 
between  successive  responses.  Klingenberg  focused  on  whether  the  weight  differential 
between  the  crews  has  an  effect.  Let  x,  be  the  average  weight  difference,  in  pounds  per 
crewman,  between  the  Cambridge  team  and  the  Oxford  team  in  year  t.  The  model 

logit[P(T,  =  1|m,)]  =  q >  +  fix, +u, 

with  autoregressive  normal  random  effect  has  a  =  0.25  (SE  —  0.44),  /)  —  0.14  (SE  — 
0.06),  a  =  2.03  (SE  =  0.81),  and  p  =  0.69  (SE  =  0.12).  Weight  seems  to  have  a  positive 
effect,  but  the  estimated  size  of  that  effect  is  rather  imprecise.  The  substantial  p  value 
reflects  the  strong  association  between  successive  outcomes. 

For  his  model  fit,  Figure  13.2  also  shows  the  estimated  conditional  and  marginal  proba¬ 
bilities  of  a  Cambridge  win.  The  influence  of  the  predicted  random  effect  is  to  move  many  of 
the  estimated  conditional  probabilities  well  away  from  0.50,  compared  with  the  estimated 
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Figure  13.2  Results  of  annual  rowing  race  (1  =  Cambridge  wins,  0  =  Oxford  wins)  in  first  panel,  with  estimated 
conditional  (second  panel)  and  marginal  (third  panel)  probabilities  of  a  Cambridge  win.  Source:  Klingenberg 
(2008).  Used  with  permission. 
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marginal  probabilities.  Klingenberg  checked  the  fit  of  the  model  in  various  ways.  One  way 
used  the  model  fit  to  estimate  three-dimensional  transition  probabilities  for  sequences  of 
three  races  in  a  row,  and  compared  these  to  the  observed  proportions  for  those  sequences. 
The  fit  seemed  to  be  quite  good. 


13.4  RANDOM  EFFECTS  MODELS  FOR  MULTINOMIAL  DATA 

Random  effects  models  for  binary  responses  extend  to  multicategory  responses.  For  the 
multicategory  models  of  Chapter  8,  adding  random  effects  extends  this  multivariate  GLM 
to  a  multivariate  GLMM  (Hartzel  et  al.  2001b).  This  class  includes  models  for  nominal  and 
ordinal  responses. 

13.4.1  Cumulative  Logit  Model  with  Random  Intercept 

Modeling  is  simpler  with  ordinal  than  nominal  responses,  since  often  the  same  random 
effect  and  the  same  fixed  effect  can  apply  to  each  logit.  With  cumulative  logits,  this  is 
the  proportional  odds  structure  (Section  8.2.2).  Denote  the  possible  outcomes  for 
observation  t  in  cluster  i,  by  1,  2, ...,/.  A  GLMM  for  the  cumulative  logits  has  the  form 

logit[P(T,v  <  j\Uj)]  =  oij  +  xj,fi  +zfruh  7  =  1 —  1  •  (13.18) 

Hedeker  and  Gibbons  (1994)  and  Tutz  and  Hennevogl  (1996)  discussed  model  fitting, 
primarily  by  treating  u,  as  multivariate  normal. 

For  cumulative  logit  and  probit  random  intercept  models,  the  same  relationship  exists 
between  their  effects  and  those  in  marginal  models  as  presented  in  Section  13.2.3  for 
binary-response  models.  Marginal  effects  tend  to  be  smaller,  increasingly  so  as  o  increases. 

13.4.2  Example:  Insomnia  Study  Revisited 

Table  12.3  showed  results  of  a  clinical  trial  at  two  occasions  comparing  a  drug  with  placebo 
in  treating  insomnia  patients.  In  Sections  12.1.3  and  12.2.3  we  analyzed  the  data  with 
marginal  models.  For  yt  =  time  to  fall  asleep  at  occasion  t,  the  marginal  model 

logit[F(T,  <  j)\  —  aj  +  P\t  +  p2x  +  Mt  x  x) 

permits  interaction  between  t  =  occasion  (0  =  initial,  1  =  follow-up)  and  x  =  treatment 
(1  =  active,  0  =  placebo).  Table  13.9  shows  the  ML  and  GEE  estimates. 


Table  13.9  Fits  of  Cumulative  Logit  Models  to  Insomnia  Data  in  Table  12.3" 


Effect 

Marginal 

ML 

Marginal 

GEE 

Random  Effects 
(GLMM)  ML 

Treatment 

0.046  (0.236) 

0.034  (0.238) 

0.058  (0.366) 

Occasion 

1.074  (0.162) 

1.038  (0.168) 

1.602  (0.283) 

Treatment  x  occasion 

0.662  (0.244) 

0.708  (0.244) 

1.081  (0.380) 

“Values  in  parentheses  are  standard  errors. 
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Now,  let  yii  denote  the  response  for  subject  i  at  occasion  t.  Table  13.9  also  shows  results 
of  fitting  the  random  intercept  model 


logit[/>(y„  <  j\u,)]  —  Ctj  +  fi\t  +  fi2X  +  X  X)  +  Uj. 

Results  are  substantively  similar  to  the  marginal  model,  but  estimates  and  standard  errors 
are  about  50%  larger.  This  reflects  the  relatively  large  heterogeneity  (er  =  1 .90)  and  the 
resultant  strong  association  between  the  responses  at  the  two  occasions. 

13.4.3  Example:  Combining  Measures  on  Ordinal  Items 

In  many  surveys,  subjects  respond  to  a  set  of  items  that  measure  various  aspects  of  some 
characteristic,  each  using  the  same  ordinal  scale.  For  example,  some  quality  of  life  instru¬ 
ments  have  separate  questions  pertaining  to  the  frequency  with  which  a  person  participates 
in  various  activities,  with  each  activity  measured  with  a  scale  such  as  (never,  rarely,  occa¬ 
sionally,  often).  In  such  cases  with  multiple  response  items  of  a  similar  nature,  it  can  be 
useful  to  combine  the  response  outcomes  into  a  single  score  for  each  subject  that  provides 
a  summary  measure  of  that  characteristic. 

One  approach,  commonly  used,  is  to  assign  scores  such  as  (1,2, 3,4)  to  the  outcome 
categories  and  for  each  subject  find  the  mean  score  on  the  set  of  items.  This  has  the  advantage 
of  simplicity  of  calculation  and  of  interpretation,  but  it  is  often  not  obvious  how  to  assign  the 
scores,  such  as  with  the  quality  of  life  scale  just  mentioned.  An  alternative  way  of  forming 
a  summary  measure  mimics  item-response  modeling  for  binary  data  (Section  13.1 .4). 

To  illustrate,  we  use  some  General  Social  Survey  data.  The  GSS  often  asks  subjects 
their  opinion  about  government  spending  in  various  areas,  such  as  health,  education,  the 
environment,  culture  and  the  arts,  defense,  law  enforcement,  unemployment  benefits,  and 
retirement  benefits.  The  outcome  scale  is  (much  more,  more,  the  same,  less,  much  less). 
We  use  only  three  items  here  and  combine  categories  1  and  2  and  combine  categories  4  and 
5  so  we  can  easily  show  the  data,  which  are  in  Table  13.10. 

For  subject  i  with  item  t  (1  =  education,  2  =  health,  3  =  environment),  we  use  the 
random  intercept  model 


logit[P(y,v  <  j\m)]  -  (Xj  +  Pil(t  =  1)  +  p2I{t  =  2)  +  Uj , 

were  I(t  =  1)  and  I(t  —2)  are  indicators  for  the  first  two  items.  Here,  u,  reflects  the 
propensity  of  subject  /  to  be  favorable  to  more  government  spending,  relatively  greater 
values  increasing  the  chance  of  response  at  the  low  end  of  the  scale  corresponding  to 

Table  13.10  Opinions  About  Government  Spending  on  the  Environment.  Education,  and 


Health  (1  = 

more,  2 

=  same,  3  = 

= less) 

Education 

Environment 

Health 

=  1 

Environment  = 
Health 

2 

Environment  = 

Health 

3 

1 

2 

3 

1 

2 

3 

1 

2 

3 

1 

651 

45 

15 

304 

59 

10 

92 

24 

17 

2 

57 

10 

3 

50 

35 

12 

15 

14 

6 

3 

7 

1 

5 

7 

10 

4 

6 

3 

16 

Source:  2006  General  Social  Survey. 
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Table  13.11  Cumulative  Logit  Predicted  Random  Effect  Values  for  Opinions  About 
Government  Spending  Data  in  Table  13.10 


Envir. 

Educ. 

Health 

Predictions 

Envir. 

Educ. 

Health 

m,  Predictions 

1 

1 

1 

0.88 

2 

1 

1 

-0.37 

1 

1 

2 

-0.31 

2 

2 

2 

-1.78 

1 

2 

1 

-0.32 

1 

2 

3 

-1.62 

higher  spending.  The  ML  fit  of  this  model  has  $\  =  1.888  ( SE  =  0.102),  $2  =  1.540 
(, SE  =  0.086),  and  a  =  1.56  (SE  =  0.08),  reflecting  a  tendency  for  subjects  to  prefer  less 
spending  on  the  environment  than  the  other  two  items.  The  latent  variable  structure  that 
generates  this  form  of  model  (Section  8.2.3)  suggests  that  differences  between  pairs  of  [it/ } 
are  location  shifts  between  subjects  in  the  distribution  of  a  latent  variable  for  government 
spending,  with  other  shifts  (described  by  {/),})  according  to  the  item  considered. 

To  construct  a  summary  measure  for  the  subjects  that  reflects  their  propensities  to  be 
favorable  to  more  government  spending,  we  predict  {«, }  using  their  posterior  means  based 
on  the  ML  model  fit.  To  illustrate,  Table  1 3. 1 1  shows  these  for  some  of  the  possible  response 
sequences.  Note  that  compared  with  the  scores  obtained  by  assigning  fixed  scores  to  the 
categories  and  finding  the  mean  outcome  for  each  subject: 

•  For  a  given  set  of  response  outcomes,  a  subject's  predicted  score  is  the  same  for  any 
permutation  of  the  outcomes  for  the  fixed  scores  approach,  but  not  for  the  cumulative 
logit  model  or  models  with  other  links.  For  example,  note  in  Table  13.1 1  the  results 
for  response  sequences  (1,1 ,2),  (1,2,1 ),  and  (2,1,1 ). 

•  The  model-based  predicted  scores  are  not  a  linear  function  of  the  mean  of  assigned 
fixed  scores.  The  spacing  is  governed  by  the  shape  of  the  distribution  for  the  underlying 
latent  variable  model,  for  example,  logistic  for  cumulative  logit  link. 

13.4.4  Example:  Cluster  Sampling 

With  surveys  that  use  cluster  sampling,  standard  methods  based  on  simple  random  sampling 
(e.g.,  for  a  single  multinomial  sample)  require  adjustment.  Ordinary  standard  errors  are  too 
small.  When  the  sampling  scheme  randomly  samples  clusters,  we  can  account  for  the 
clustering  using  cluster  random  effects.  We  illustrate  using  data  from  Brier  (1980),  who 
reported  96  observations  taken  from  20  neighborhoods  (the  clusters)  on  Y  —  satisfaction 
with  home  and  x  =  satisfaction  with  neighborhood  as  a  whole.  The  data  are  shown  at 
the  text  website.  Each  variable  was  measured  with  the  ordinal  scale  (unsatisfied,  satisfied, 
very  satisfied).  Brier’s  analysis  adjusted  for  clustering  by  reducing  the  Pearson  statistic 
for  testing  independence  in  the  3x3  contingency  table  relating  X  and  Y  from  1 7.9  to  15.7 
(df  =  4). 

Consider  the  model  for  y-„,  observation  I  in  cluster  i. 


logit[P(T„  <  j \  uj)]  =  o/j  +xitp  +  m,  (13.19) 

with  scores  (1,  2,  3)  for  the  satisfaction  levels  of  xit.  With  a  N( 0,  a2)  distribution  assumed 
for  m, ,  the  ML  effect  estimate  is  $  =  —1.201  (SE  =  0.407),  with  a  =  0.92.  By  contrast, 
treating  the  96  observations  as  a  random  sample  corresponds  to  fitting  this  model  with 
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a  =  0.  It  has  $  —  —1,226  (SE  —  0.370).  A  slight  reduction  in  significance  results  from 
adjusting  for  clustering. 

Rao  and  Thomas  (1988)  surveyed  ways  of  adjusting  standard  inferences  to  take  into 
account  complex  sampling  methods  in  the  analysis  and  modeling  of  categorical  data.  The 
usual  chi-squared  test  statistics  no  longer  have  chi-squared  null  distributions,  but  rather, 
weighted  sums  of  chi-squared.  See  also  references  in  Note  3.4. 

13.4.5  Baseline-Category  Logit  Models  with  Random  Effects 

For  nominal  response  variables,  we  can  formulate  a  binary  GLMM  that  pairs  each  category 
with  a  baseline  and  fit  these  models  simultaneously  while  allowing  separate  effects.  This 
requires  using  a  vector  of  cluster-specific  random  effects  «,y,  one  for  each  logit.  The  general 
form  of  the  baseline-category  logit  model  with  random  effects  is 

.  P(Yit  =  j )  Ta  .  T  I  ,i 

lo§  P{Y.  =  /)  =  ai  +  X"P>  +  z»  J  =  •  7  ” 

The  fixed  effects  /?;  and  the  random  effects  u,;  depend  on  j,  since  the  baseline  category  is 
arbitrary. 

Cluster  i  has  a  vector  uj  —  ( uft , ....  «?,_,)  of  random  effects,  treated  as  independent 
multivariate  normal  variates.  We  recommend  an  unspecified  covariance  matrix  T  for  ,  to 
allow  different  variances  for  random  effects  that  apply  to  different  logits.  With  a  common 
variance,  that  variance  would  not  be  the  same  as  that  for  the  implied  random  effect  for 
a  logit  for  an  arbitrary  pair  of  categories,  loglPfT,,  =  j)/P(Yj,  =  A)].  With  unspecified 
covariance  the  model  is  structurally  the  same  regardless  of  the  choice  of  baseline  category. 

13.4.6  Example:  Effectiveness  of  Housing  Program 

Hedeker  (2008)  discussed  a  California  study2  designed  to  investigate  the  effectiveness  of  a 
housing  certificate  regarding  whether  individuals  diagnosed  with  mental  illness  who  were 
homeless  or  at  high  risk  of  becoming  homeless  were  able  to  choose  and  stay  in  independent 
housing  in  their  community.  The  housing  certificates  required  clients  to  pay  30%  of  their 
income  toward  rent.  Eligible  subjects  were  randomly  assigned  to  two  groups,  one  of  which 
received  the  certificates  and  the  other  of  which  was  a  control  group.  Initially  and  after  6, 
12,  and  24  months  the  subjects’  housing  status  was  classified  as  (independent  housing, 
community  housing,  streets/shelters). 

Let  c  indicate  whether  a  subject  was  in  the  certificate  group  (1  =  yes,  0  =  no),  and  let 
t\ ,  D,  and  h  be  indicators  for  contrasting  each  time  with  the  initial  baseline.  For  subject  i  at 
time  t,  Hedeker  proposed  the  model 

P(Y  =  j) 

log  p  v  ~  ^  =  aj  +  p\ jt\  +  fcjh  +  fcjti  +  p4jc  +  p5j(c  x  fi) 

P\Y  a  —  4) 

+  Pbjic  x  t2)  +  Py(c  x  r3)  +  Ujj, 

for  j  =  1,2.  Table  13.12  shows  ML  estimates  and  SE  values  for  the  model,  based  on 
treating  missing  observations  as  missing  at  random. 

2The  data  and  a  SAS  file  for  these  analyses  are  at  t  igger .  uic  .  edu/~hedeker/long .  html. 
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Table  13.12  ML  Estimates  for  Baseline-Category  Logit  Random  Effects  Model  for 
Modeling  Effectiveness  of  Program  for  Independent  Housing 


Effect 

Independent 

vs.  Street 

Community 

vs.  Street 

Estimate 

SE 

Estimate 

SE 

Intercept 

-2.67 

0.37 

-0.45 

0.19 

t\  (6  months) 

2.68 

0.42 

1.94 

0.31 

t2(  12  months) 

4.09 

0.56 

2.82 

0.47 

f3  (24  months) 

4.10 

0.47 

2.26 

0.38 

c  (certificate  indicator) 

0.78 

0.49 

0.52 

0.27 

C  X  t\ 

2.00 

0.61 

-0.14 

0.49 

CXfi 

0.55 

0.69 

-1.92 

0.61 

C  X  f3 

0.30 

0.62 

-0.95 

0.54 

a  (random  effects) 

2.33 

0.20 

0.87 

0.14 

Source:  With  kind  permission  from  Springer,  from  Tables  6.4  and  6.5  of  Hedeker  (2008). 


From  the  first  logit,  the  log  odds  ratio  comparing  the  two  certificate  groups  in  terms  of 
independent  housing  vs.  street/shelters  is  0.78  initially,  0.78  +  2.00  =  2.78  after  6  months, 
0.78  +  0.55  =1.33  after  1 2  months,  and  0.78  +  0.30  =  1 .08  after  24  months.  The  increase 
in  independent  housing  from  the  initial  measurement  is  quite  pronounced  for  the  certificate 
group  (relative  to  the  control  group)  after  6  months,  but  is  not  significant  at  1 2  or  24  months. 
The  relatively  large  a  —  2.33  for  the  random  effects  for  this  logit  reflects  strong  positive 
associations  among  the  repeated  responses,  conditional  on  response  in  one  of  these  two 
categories. 

We  leave  further  interpretation  of  effects  to  Exercise  13.19.  A  more  complex  model 
permitting  a  different  a  for  each  group  for  each  logit  did  not  lit  significantly  better. 


13.5  MULTILEVEL  MODELING 

In  some  research  studies,  the  data  structure  is  hierarchical,  with  sampled  units  nested  in 
clusters  that  are  themselves  nested  in  other  clusters.  Patients  are  nested  in  hospitals,  which 
are  themselves  nested  in  communities.  Individuals  are  nested  in  families.  In  a  longitudinal 
study,  repeated  measurements  are  nested  within  a  cluster  that  is  a  person  observed  over 
time.  Random  effects  can  enter  models  for  such  data  at  different  levels  of  the  hierarchy. 

Many  early  uses  of  hierarchical  models  were  in  educational  applications  (e.g.,  Aitkin 
et  al.  1 98 1 ).  A  statewide  study  of  factors  that  affect  student  performance  might  measure  each 
student’s  scores  on  a  battery  of  exams  but  use  a  model  that  takes  into  account  the  student, 
the  school  or  school  district,  and  the  county.  Just  as  two  observations  on  the  same  student 
will  tend  to  be  more  alike  than  observations  on  different  students,  so  will  two  students  in 
the  same  school  tend  to  be  more  alike  than  two  students  from  different  schools.  Student, 
school,  and  county  terms  can  be  treated  as  random  effects,  with  different  ones  referring  to 
different  levels  of  the  model.  A  model  might  have  students  at  level  1 ,  schools  at  level  2,  and 
counties  at  level  3.  GLMMs  for  data  having  a  hierarchical  structure  of  this  sort  are  called 
multilevel  models. 

When  the  data  have  a  hierarchical  structure,  the  model  should  reflect  that  fact  rather  than 
ignore  it.  The  researcher  can  then  pay  attention  to  explanatory  variables  that  are  relevant 
at  each  level,  and  decompose  the  total  error  variability  into  portions  corresponding  to  each 
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level.  Using  multilevel  models  also  helps  to  correct  for  biases  that  would  occur  if  we  ignored 
the  clustering  and  the  consequent  within-cluster  correlations. 

13.5.1  Hierarchical  Random  Terms:  Partitioning  Variability 

We  illustrate  with  a  two-level  model.  Let  ynj)t  denote  the  response  for  student  /,  who  attends 
school  j,  on  test  tin  a  battery  of  tests,  with  1  —  pass  and  0  =  fail.  A  multilevel  model  with 
random  effects  (v,(7))  for  students  and  {u  y)  for  schools  and  fixed  effects  for  explanatory 
variables  has  the  form 


logit[/»(Y,w  =  1)]  =  xfU)tp  +  Uj  +  v/(y).  (13.20) 

Here,  the  explanatory  variables  x  might  include  a  factor  that  identifies  the  test  in  the  battery 
(with  categories  such  as  math,  verbal,  ...).  The  random  effects  u  /  and  v/(j)  are  assumed  to 
be  independent  with  distributions  N( 0,  a2)  and  N(0,  cx2)  having  unknown  variances. 

The  level  1  random  effects  {v,(7)}  account  for  variability  among  students  in  ability  as 
well  as  in  characteristics  that  are  not  measured  in  x,  such  as  perhaps  their  achievement 
motivation.  A  relatively  large  cr,  value  induces  a  strong  correlation  among  the  test  results 
for  students.  The  level  2  random  effects  {u j]  account  for  variability  among  schools  due  to 
possibly  unmeasured  variables  such  as  quality  of  the  teachers.  Model  ( 1 3.20)  is  a  random 
intercept  model.  More  general  models  also  can  have  random  slopes  for  effects  of  explanatory 
variables. 

As  in  Section  7.1.1,  a  latent  variable  model  implies  this  model.  Let  y*(/),  denote  the 
latent  observation  for  student  /  in  school  j  on  lest  t,  such  that  we  observe  y,(7)/  =  1  if  y*(J)l 
falls  above  some  threshold,  such  as  0.  The  latent  model  is 

y?(j)r  -  xI(j),P  +  uj  +  v’/<7)  +  €Hj)>- 

The  assumption  that  {€,(/), )  come  from  a  standard  logistic  distribution,  for  which  the  inverse 
cdf  is  the  logit  link  function,  implies  the  logistic  random  effects  model.  For  it,  conditional 
on  U/  and  v,(/),  the  observed  response  satisfies  the  logistic  model  (13.20).  The  assumption 
that  e,(;),  come  from  a  standard  normal  distribution  implies  a  corresponding  probit  random 
effects  model. 

Although  the  random  effects  enter  at  two  levels,  this  representation  shows  that  such  a 
model  actually  has  three  levels:  A  particular  observation  is  affected  (beyond  the  influence 
of  the  explanatory  variables)  by  random  variability  among  schools,  among  students  within 
the  school,  and  among  tests  taken  by  a  student. 

With  independent  error  terms,  the  total  unexplained  variability  in  this  latent  variable 
model  is  var (uj)  +  var(v;,7))  +  var(e,(7),),  where  var(e,(7),)  =  tt2/3  =  3.29  for  the  logistic 
model  and  1.0  for  the  probit  model.  Strong  correlations  between  scores  on  different  tests 
for  the  students  correspond  to  a  relatively  large  value  of  var(v,(7)),  and  a  relatively  large 
proportion  of  the  total  variability  that  is  due  to  this  variability  among  students. 

13.5.2  Example:  Children’s  Care  for  an  Unmarried  Mother 

What  factors  help  to  predict  whether  an  adult  child  provides  care  for  an  unmarried  elderly 
mother?  This  was  recently  investigated  by  sociologist  J.  Henretta  and  two  colleagues  using 
a  longitudinal  study  with  a  cohort  of  subjects  from  the  Health  and  Retirement  Study.  Their 
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study  used  16,719  observations  on  5607  mother-child  pairs  in  1925  families.  The  outcome 
measure  was  whether  a  child  provided  care  for  the  mother  by  providing  financial  help 
or  assistance  with  daily  living  tasks.  This  was  observed  by  interviews  of  the  mothers  in 
1998,  2000,  2002,  and  2004.  The  study  was  restricted  to  families  in  which  the  mother  was 
unmarried  at  her  first  interview,  since  spouses  rather  than  children  are  the  primary  helpers  of 
married  elders.  About  half  the  mothers  died  during  the  study  period,  and  proxy  respondents 
were  used  for  one  interview  after  this  happened.  Overall,  care  was  reported  to  be  provided 
in  1 8%  of  the  observations. 

Let  ynj)i  denote  the  response  for  child  /  in  family  j  about  whether  he/she  provides  care  for 
their  elderly  mother  when  observed  at  time  t  ( 1  =  yes,  0  =  no).  The  explanatory  variables 
were  ethnicity,  the  year  of  the  observation,  characteristics  of  the  mother  (health,  age,  assets, 
whether  the  mother  was  experiencing  her  final  illness  before  death),  characteristics  of  the 
child  (sex,  whether  married,  whether  a  stepchild,  whether  has  children,  whether  attended 
college,  whether  the  mother  raised  a  child  of  his/hers  for  at  least  a  year,  whether  the 
child  received  from  the  mother  at  least  $5000  in  financial  help  in  the  10  years  preceding 
1993),  and  characteristics  of  the  family  (family  size,  %  of  children  who  are  male,  %  of 
children  who  are  married,  %  of  children  who  are  a  stepchild,  %  of  children  who  have  their 
own  children,  %  of  children  who  attended  college,  whether  the  mother’s  family  received 
financial  help  from  relatives  before  she  was  of  age  16).  All  these  variables  were  categorical 
in  measurement. 

Positive  correlation  occurs  among  repeated  observations  over  time  for  a  child  and  also 
among  observations  from  different  children  within  the  same  family.  Thus,  the  researchers 
posed  multilevel  models  of  form  ( 1 3.20)  with  a  random  effect  for  each  child  at  level  1  and 
a  random  effect  for  each  family  at  level  2.  Table  1 3. 1 3  shows  ML  estimates  and  SE  values 
for  their  full  model.  All  multicategory  explanatory  variables  were  treated  as  nominal,  using 
a  set  of  indicator  variables.  For  example,  the  estimates  shown  for  assets  used  the  sixth 
of  the  seven  categories  (namely,  $100,000-249,000)  as  the  baseline,  which  was  the  most 
common  category.  The  estimates  indicate  that  a  higher  probability  of  help  is  associated  with 
the  mother  having  poorer  health  and  more  advanced  age  and  in  her  final  illness,  with  the 
child  being  female  and  not  a  stepchild  and  without  children,  with  the  family  being  smaller 
and  with  relatively  more  children  who  are  male  and  who  themselves  have  children,  and 
when  the  mother  reported  that  her  family  received  help  when  she  was  growing  up. 

To  illustrate  interpretation  of  explanatory  effects,  consider  final  illness.  For  a  given  child 
and  fixed  values  of  other  explanatory  variables,  the  estimated  odds  of  providing  help  during 
the  mother’s  final  illness  were  exp(L41 1)  =  4.1  times  the  estimated  odds  when  it  was  not 
that  time.  Henretta  and  colleagues  also  provided  estimated  probabilities  of  providing  help. 
At  the  most  common  categories  for  each  of  the  other  explanatory  variables  and  at  random 
effect  values  of  0,  the  estimated  probabilities  were  0.17  when  it  was  the  time  of  the  final 
illness  and  0.05  when  it  was  not.  Such  probability  estimates  vary  considerably  according  to 
values  of  explanatory  variables.  For  example,  this  pair  (0.1 7, 0.05)  of  estimated  probabilities 
for  care  under  (final  illness,  not  final  illness)  changed  for  cases  of  only  biological  children 
to  (0.46,  0.17)  for  females  and  (0.34,  0. 1 1 )  for  males,  and  for  biological  children  with  two 
female  sibs  to  (0.13,  0.03)  for  females  and  (0.05,  0.01)  for  males.  The  pairs  show  that  the 
effect  of  final  illness  is  very  strong.  These  last  four  pairs  also  provide  a  description  of  the 
sex  effect,  with  daughters  being  more  likely  to  provide  care,  and  show  that  care  is  much 
more  likely  by  an  only  child  than  by  a  child  with  two  female  sibs. 

For  inference,  from  Table  1 3. 1 3  it  is  possible  only  to  construct  Wald  tests  and  confidence 
intervals  comparing  each  category  of  a  factor  to  its  baseline.  However,  using  log-likelihood 
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Table  13.13  ML  Estimates  and  SE  Values  for  Multilevel  Model  for  Whether  an 
Adult  Child  Cares  for  Her  Unmarried  Elderly  Mother 


Effect 

Estimate 

SE 

Effect 

Estimate 

SE 

Intercept 

-2.027 

0.317 

Ethnicity  (vs.  White) 

Child  characteristics 

Black 

0.162 

0.157 

Sex  (Male  =  1) 

-1.435 

0.118 

Hispanic 

-0.165 

0.207 

Married  (Yes  =  1) 

-0.179 

0.1 19 

Other 

0.459 

0.498 

Stepchild  (Yes  =  1) 

-3.574 

0.503 

Year  (vs.  1998) 

Children  (Yes  =  1) 

-0.414 

0.154 

2000 

-0.152 

0.084 

College  (Yes  =  1) 

0.183 

0.142 

2002 

0.019 

0.092 

Parent  raised  child 

0.154 

0.250 

2004 

0.072 

0.106 

Parent  finan.  help 

-0.205 

0.184 

Mother's  characteristics 

Family  characteristics 

Health  (vs.  Excellent) 

Family  size  (vs.  1) 

Very  good 

-0.105 

0.173 

2 

-1.052 

0.181 

Good 

0.420 

0.169 

3 

-1.538 

0.187 

Fair 

0.701 

0.173 

4 

-1.967 

0.201 

Poor 

0.867 

0.182 

5-6 

-2.508 

0.207 

Age  (vs.  75-79) 

7+ 

-2.521 

0.224 

70-74 

-0.552 

0.177 

%  Children 

80-84 

0.482 

0.096 

Male 

0.946 

0.203 

85-89 

0.928 

0.123 

Married 

—0.051 

0.202 

90+ 

1.213 

0.156 

Stepchild 

0.940 

0.478 

Assets  (dollars) 

Have  children 

0.464 

0.236 

(vs.  100,000-249,000) 

Attended  college 

-0.136 

0.192 

Negative 

-0.336 

0.258 

Family  got  help  (vs.  No) 

0 

0.004 

0.151 

Yes 

0.595 

0.187 

<25,000 

0.070 

0.118 

Missing 

1.300 

0.290 

25,000^9,999 

0.234 

0.128 

50,000-99,999 

0.171 

0.111 

250,000+ 

-0.184 

0.137 

Final  illness 

1.41 1 

0.088 

Source:  Results  taken  from  Table  2  in  J.  Henrettaet  al ,,J.  Marriage  &  Family,  73:  383-395,  2011.  Reprinted  with 
permission  of  J.  Wiley  &  Sons. 


values  L  reported  by  software  for  the  model  and  for  the  model  with  the  factor  removed, 
it’s  possible  to  do  the  usual  likelihood-ratio  tests.  For  example,  Henretta  and  colleagues 
reported  L  =  —6133.2  for  the  model  shown  in  Table  13.13  and  L  =  —6255.3  for  the 
simpler  model  that  removes  all  the  family  characteristics.  Double  that  difference,  which 
equals  244.2,  is  a  chi-squared  statistic  with  df  =  12  for  testing  the  hypothesis  that  none  of 
the  family  characteristics  have  an  effect. 

The  estimated  variance  components  were  4.38  ( SE  —  0.32)  for  the  vvy,  child  random 
effects  and  1 .20  (SE  =  0. 1 8)  for  the  uj  family  random  effects.  As  we'd  expect,  these  reflect 
an  especially  strong  within-child  correlation  in  the  repeated  responses.  Since  4.38/(4.38  + 
1.20  +  7T2/3)  =  0.49,  variability  among  the  children  accounts  for  49%  of  the  total  residual 
variance.  Since  1.20/(4.38  +  1.20  +  tt2/3)  =  0.14,  family  membership  accounts  for  14% 
of  the  total  residual  variance.  This  measure  of  variability  among  families  also  describes 
the  degree  to  which  siblings  in  a  family  are  similar  to  each  other,  net  of  the  explanatory 
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variables  in  the  model.  The  variability  among  families  was  more  substantial  in  simpler 
models  that  did  not  include  explanatory  variables  pertaining  to  the  families  such  as  mother’s 
characteristics. 

More  complex  models  with  interaction  terms  did  not  fit  significantly  better.  In  the 
other  direction,  the  model  shown  in  Table  13.13  could  be  simplified  by  removing  some 
nonsignificant  factors  or  by  treating  the  ordinal  factors  health,  age,  and  family  size  in  a 
quantitative  manner  and  describing  such  effects  by  trends. 


13.6  GLMM  FITTING,  INFERENCE,  AND  PREDICTION 

Model  fitting  is  rather  complex  for  GLMMs,  because  the  likelihood  function  does  not  have 
a  closed  form.  Numerical  methods  for  approximating  it  can  be  computationally  intensive 
for  models  with  multivariate  random  effects.  In  this  section  we  outline  the  basic  ideas  of 
ML  fitting.  See  Fahrmeir  and  Tutz  (2001 ,  Chap.  7)  and  McCulloch  et  al.  (2008)  for  more 
details. 

13.6.1  Marginal  Likelihood  and  Maximum  Likelihood  Fitting 

The  GLMM  is  a  two-stage  model.  At  the  first  stage,  conditional  on  the  random  effects  {«, ), 
observations  are  assumed  to  follow  a  GLM.  That  is,  all  observations  are  independent,  with 
>>,,  in  cluster  /  having  distribution  in  the  exponential  family  with  expected  value  linked 
to  a  linear  predictor. 


g(H,t)  -xj,P  +  zj,u,. 

Then,  zJ,Ui  is  a  known  offset.  At  the  second  stage,  {«,  )  are  assumed  independent  from  a 
N(0,  E)  distribution. 

For  a  discrete  variable,  denote  the  vector  of  observations  by  y  and  the  vector  of  random 
effects  by  u.  Let  f(y\u;  /?)  denote  the  conditional  mass  function  of  y,  given  u.  Let  /(«;  E) 
denote  the  normal  probability  density  function  for  u.  The  likelihood  function  f(/L  E;y) 
for  a  GLMM  is  the  probability  mass  function  f(y;fi ,  E)  of  y,  viewed  as  a  function  of  /? 
and  E.  This  mass  function  refers  to  the  marginal  distribution  of  j  after  integrating  out  the 
random  effects, 


t(fi,T;y)  =  f(y,p,'E)  =  f  f(y\u;fi)f(u;  T)du.  (13.21) 

It  is  often  called  a  marginal  likelihood.  For  example,  the  marginal  likelihood  function 
f(/J,  ct2;  y)  for  the  logistic-normal  random  intercept  model  (13.6)  (absorbing  a  into  /?)  is 


n  ( r  n  r  exp (xip+ui)  V"  r 

,.  |  7-00  ,  L  1  +  exp(*J)3  -(-  Uj )  J  [ 


I  +  exp  (xf,P  +  iij)  J 


f(Uj\o2)dUj 


Many  methods  can  evaluate  this  numerically  and  maximize  it  as  a  function  of  /?  and  E. 
It  is  an  active  area  of  research  to  develop  improved  methods.  We  next  discuss  a  few  of  the 
most  popular  methods. 
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13.6.2  Gauss-Hermite  Quadrature  Methods  for  ML  Fitting 

The  integral  determining  the  likelihood  function  has  dimension  that  depends  on  the  random 
effects  structure.  When  the  dimension  is  small,  as  in  the  one-dimensional  integral  above, 
standard  numerical  integration  methods  can  approximate  the  likelihood  function. 

Gauss-Hermite  quadrature  is  a  method  for  approximating  the  integral  of  a  function  /(•) 
multiplied  by  a  normal  density  function.  The  approximation  is  a  finite  weighted  sum  that 
evaluates  the  function  at  certain  points.  In  the  univariate  normal  random  effects  case,  the 
approximation  has  the  form 

/GO  q 

f(u)exp(-u2)du  % 

00  k=  1 

with  weights  {q }  and  quadrature  points  {q }  that  are  tabulated.  The  approximation  improves 
as  q,  the  number  of  quadrature  points,  increases. 

The  approximated  likelihood  can  be  maximized  with  standard  algorithms  such  as 
Newton-Raphson,  yielding  ML  estimates  p  and  X.  Inverting  an  approximation  for  the 
observed  information  matrix  provides  standard  errors  for  the  ML  estimates.  For  complex 
models,  second  partial  derivatives  for  the  Hessian  may  be  computed  numerically  rather 
than  analytically.  Adequate  approximation  usually  requires  larger  q  for  standard  errors  than 
for  fi.  We  recommend  sequentially  increasing  q  until  the  changes  are  negligible  in  both  the 
estimates  and  standard  errors. 

When  the  function /  to  be  integrated  is  not  centered  at  0,  many  of  the  quadrature  points 
may  fall  outside  the  main  region  of  integration.  An  adaptive  version  of  Gauss-Hermite 
quadrature  (Liu  and  Pierce  1994,  Rabe-Hesketh  et  al.  2005)  centers  the  quadrature  points 
with  respect  to  the  mode  of  the  function  being  integrated  and  scales  them  according  to 
the  estimated  curvature  at  the  mode.  This  improves  efficiency,  dramatically  reducing  the 
number  of  quadrature  points  needed  to  approximate  the  integrals  effectively.  Lesaffre 
and  Spiessens  (2001)  showed  comparisons  and  warned  against  using  too  few  quadrature 
points. 

13.6.3  Monte  Carlo  and  EM  Methods  for  ML  Fitting 

Multivariate  forms  of  Gauss-Hermite  quadrature  handle  multivariate,  correlated  random 
effects.  Adequate  approximation  becomes  more  difficult,  however,  as  the  dimension  of  the 
integral  increases  much  beyond  the  bivariate  case.  Then,  Monte  Carlo  methods  are  more 
feasible  computationally  than  numerical  integration.  Various  Monte  Carlo  approaches  are 
available  [e.g.,  McCulloch  et  al.  (2008,  Chap.  14],  including  Monte  Carlo  in  combination 
with  Newton-Raphson,  Monte  Carlo  in  combination  with  the  EM  algorithm,  and  simulation 
to  estimate  the  likelihood  directly.  Here,  we  briefly  describe  a  Monte  Carlo  EM  (MCEM) 
algorithm. 

The  EM  algorithm  is  a  popular  iterative  method  of  finding  ML  estimates  when  data  are 
missing  or  when  filling  in  some  “missing”  data  simplifies  a  likelihood  function.  Laird  (2005) 
and  Fan  et  al.  (2010)  gave  useful  reviews.  In  each  cycle,  an  £-step  takes  an  expectation 
over  the  missing  data  at  working  values  of  the  parameters  to  approximate  the  likelihood 
function,  and  an  M-step  maximizes  that  function  to  generate  new  working  values  of  the 
parameter  estimates.  With  GLMMs,  we  regard  the  random  effects  u  as  missing  data.  Then, 
h(y,  m;  p,  X)  —  f{y\u\  p)f{u\  X)  specifies  the  joint  distribution  of  the  complete  data.  The 
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£-step  in  iteration  r  of  the  EM  algorithm  calculates,  using  Monte  Carlo  methods, 

£[log/t(j’,  h;/3,  E)|y;/?(,),  E(,)]. 

The  expectation  refers  to  the  distribution  of  («|y)  with  parameter  values  equal  to  and 
E(r),  the  working  estimates  for  iteration  r.  The  distribution  of  (m 1 3?)  follows  from  those  of 
(y|«)  and  u  in  the  GLMM  via  Bayes’  theorem.  TheM-step  then  maximizes  (using  MCMC 
or  other  Monte  Carlo  methods)  the  resulting  function  of  (y,  /?,  E)  with  respect  to  /?  and  E 
to  obtain  p(r+l>  and  E(r+I).  For  details,  including  ways  of  choosing  an  appropriate  Monte 
Carlo  sample  size,  see  Booth  and  Hobert  (1999).  Unfortunately,  this  method  can  also  be 
impractical  for  large  problems. 

13.6.4  Laplace  and  Penalized  Quasi-likelihood  Approximations  to  ML 

The  Gauss-Hermite  and  Monte  Carlo  integration  methods  provide  likelihood  approxi¬ 
mations  such  that  resulting  parameter  estimates  converge  to  the  ML  estimates  as  they 
are  applied  more  finely,  that  is,  as  the  number  of  quadrature  points  increases  for 
Gauss-Hermite  integration  and  as  the  Monte  Carlo  sample  size  increases  in  the  MCEM 
method.  This  contrasts  with  other  approximate  methods  that  are  simpler  but  do  not  yield 
ML  estimates.  These  methods  maximize  an  analytical  approximation  of  the  likelihood 
function. 

Recall  that  the  likelihood  function  (13.21)  results  from  integrating  out  the  random 
effects  u  from  the  joint  distribution  of  y  and  u.  Using  the  exponential  family  representation 
of  each  component  of  that  joint  distribution,  the  integrand  of  (13.21)  is  an  exponential 
function  of  u.  One  approach  approximates  that  function  using  a  second-order  Taylor  series 
expansion  of  its  exponent  around  a  point  u  at  which  the  first-order  term  equals  0.  [That 
point  u  £(u|y).]  The  approximating  function  for  the  integrand  is  then  exponential  with 
quadratic  exponent  in  (m  —  it)  and  has  the  form  of  a  constant  multiple  of  a  multivariate 
normal  density.  Thus,  its  integral  has  closed  form.  This  type  of  integral  approximation  is 
called  a  Laplace  approximation.  The  approximation  for  integral  ( 1 3.2 1 )  is  then  treated  as 
a  likelihood  and  maximized  with  respect  to  /?  and  E . 

For  one  such  method  (Breslow  and  Clayton  1993).  the  integral  approximation  yields  a 
function  approximating  the  log  likelihood  that  has  the  form 

<7(/3,y)-(l/2)wrZ-'«, 

wher  eq(/3,  y)  resembles  a  quasi-log-likelihood  function  for  the  GLM  conditional  on  u  =  u. 
Thus,  the  approximation  results  in  a  penalty  for  the  quasi-log  likelihood,  with  the  penalty 
increasing  as  elements  of  u  increase  in  absolute  value.  This  approach  is  called  penalized 
quasi-likelihood  (PQL). 

The  calculations  for  maximizing  the  penalized  quasi-likelihood  use  methods  for  linear 
mixed  models  with  a  normal  response.  This  treats  a  linearization  of  the  logit  as  a  working 
response  and  entails  iterative  solution  of  sets  of  likelihood-like  equations  in  /?  and  u.  PQL 
methods  do  not  require  numerical  or  Monte  Carlo  integration  and  so  are  computationally 
simpler  than  ML  methods.  Unfortunately,  they  can  perform  poorly  relative  to  ML  (Breslow 
and  Lin  1995).  When  true  variance  components  are  large,  ordinarily  PQL  tends  to  produce 
variance  component  estimates  with  substantial  negative  bias.  The  PQL  estimators  also 
behave  poorly  when  the  response  distribution  is  far  from  normal  (e.g.,  binary). 
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Refinements  have  been  and  are  being  developed  to  lessen  the  bias  of  methods  based 
on  the  Laplace  approximation.  These  are  useful,  because  precise  ML  inference  using 
Gauss-Hermite  quadrature  or  Monte  Carlo  methods  is  still  impractical  for  large  problems 
with  GLMMs.  See  Zipunnikov  and  Booth  (2012)  for  one  such  method  and  for  relevant 
references. 

13.6.5  Inference  for  GLMM  Parameters 

After  fitting  the  model,  inference  about  fixed  effects  proceeds  in  the  usual  way.  For  in¬ 
stance,  likelihood-ratio  tests  can  compare  nested  models.  Asymptotics  for  GLMMs  apply 
as  the  number  of  clusters  increases,  rather  than  as  the  numbers  of  observations  within  the 
clusters  increase.  Similarly,  resampling  methods  such  as  the  bootstrap  using  a  large  number 
of  clusters  should  sample  clusters  rather  than  individual  observations  within  clusters,  to 
preserve  the  within-cluster  dependence. 

Inference  about  random  effects  (e.g.,  their  variance  components)  is  more  complex.  For 
instance,  sometimes  one  model  is  a  special  case  of  another  in  which  a  variance  component 
equals  0.  The  simpler  model  then  falls  on  the  boundary  of  the  parameter  space  relative  to 
the  more  complex  model,  so  ordinary  likelihood-based  inference  does  not  apply.  For  the 
most  common  situation,  testing  Hq:  a2  —  0  against  Hu:  a2  >  0  for  a  model  containing 
a  random  intercept,  the  null  asymptotic  distribution  of  the  likelihood-ratio  statistic  is  an 
equal  mixture  of  Xq  (i.e.,  degenerate  at  0)  and  xf  random  variables  (Self  and  Liang  1987). 
The  value  of  0  occurs  when  a  —  0,  in  which  case  the  maximized  likelihoods  are  identical 
under  Hq  and  Ha.  When  a  >  0  and  the  observed  test  statistic  equals  t,  the  P-value  for 
this  large-sample  test  is  ^P(X\  >  0,  half  the  P-value  that  applies  for  xf  asymptotic  tests. 
For  testing  more  than  one  variance  component,  the  mixture  distribution  is  more  complex 
(Molenberghs  and  Verbeke  2007). 

13.6.6  Prediction  Using  Random  Effects 

We’ve  used  random  effects  in  models  to  represent  heterogeneity  of  certain  characteristics, 
such  as  probabilities  or  odds  ratios.  Estimated  effects  of  interest  are  often  then  linear 
combinations  of  fixed  and  random  effects.  For  example,  in  the  clinical  trial  comparing  two 
treatments  with  random  effects  for  centers  (Section  1 3.3.5),  we  can  predict  the  probability 
of  success  for  each  treatment  in  each  center  and  odds  ratios  in  those  centers. 

Given  the  data,  the  conditional  distribution  of  («| y)  contains  the  information  about  the 
random  effects  u.  A  prediction  for  u  is  E(u\y),  its  posterior  mean  given  the  data.  Calcu¬ 
lation  of  E(u\y)  itself  requires  numerical  integration  or  Monte  Carlo  approximation.  The 
expectation  depends  on  /?  and  E,  so  in  practice  we  substitute  /J  and  T,  in  the  approximation. 
The  standard  error  of  the  predictor  of  the  random  effect  u,  is  the  standard  deviation  of  the 
distribution  of  (u,\y).  When  we  substitute  fi  and  T  in  E(u\y),  however,  the  standard  error 
does  not  account  for  the  sampling  variability  in  those  estimates.  Hence,  the  true  standard 
error  tends  to  be  underestimated  (Booth  and  Hobert  1998). 

This  approach  to  prediction  using  posterior  means  of  random  effects  provides  effect 
estimates  that  exhibit  shrinkage  relative  to  estimates  using  only  data  in  the  specific  cluster. 
In  this  sense  the  results  are  similar  to  those  using  an  empirical  Bayes  approach  (Section 
3.6.7,  Efron  and  Morris  1975,  Ten  Have  and  Localio  1999).  Shrinkage  estimators  can  be  far 
superior  to  sample  values  when  the  sample  size  for  estimating  each  parameter  is  small,  when 
there  are  many  parameters  to  estimate,  or  when  the  true  parameter  values  are  roughly  equal. 
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13.7  BAYESIAN  MULTIVARIATE  CATEGORICAL  MODELING 

With  the  Bayesian  approach  to  GLMMs,  the  distinction  between  fixed  and  random  effects 
need  no  longer  occur,  as  every  effect  has  a  probability  distribution.  However,  there  is  still 
the  distinction  between  cluster-specific  versus  population-averaged  effects  according  to 
whether  the  linear  predictor  contains  a  term  for  each  cluster. 

13.7.1  Marginal  Homogeneity  Analyses  for  Matched  Pairs 

For  matched-pairs  binary  data  without  explanatory  variables,  Altham  (1971)  proposed 
Bayesian  analyses.  In  the  simplest  case,  she  considered  a  model  in  which  the  probability  of 
success  is  assumed  the  same  for  each  subject  at  a  given  occasion.  She  showed  that  the  clas¬ 
sical  exact  P-value  for  testing  the  null  hypothesis  of  marginal  homogeneity  (Section  1 1 . 1 .5, 
using  the  binomial  distribution)  is  also  a  Bayesian  posterior  probability  for  a  Dirichlet  prior 
distribution  favoring  Hq . 

Altham  also  used  a  model  similar  to  (1 1.8)  in  which  the  probability  varies  by  subject 
but  the  occasion  effect  is  constant.  She  showed  that  the  Bayesian  evidence  against  the  null 
hypothesis  is  weaker  as  the  number  of  pairs  giving  the  same  response  at  both  occasions 
increases,  for  fixed  values  of  the  numbers  of  pairs  giving  different  responses  at  the  two 
occasions.  This  differs  from  the  frequentist  result,  using  the  McNemar  test  or  conditional 
ML  or  random  effects  marginal  ML,  which  does  not  depend  on  such  pairs  (e.g..  Sec¬ 
tion  1 1.2.3).  Consonni  and  La  Rocca(2008)  and  Ghosh  et  al.  (2000)  showed  related  results. 

13.7.2  Bayesian  Approaches  to  Meta-analysis  and  Multicenter  Trials 

In  Sections  13.3.5  and  13.3.6  we  used  random  effects  models  to  summarize  heterogeneity 
in  multicenter  clinical  trials  and  in  meta-analyses.  Comparable  Bayesian  analyses  can  use 
similar  distributions  for  the  random  effects  but  also  for  the  fixed  effects. 

Skene  and  Wakefield  (1990)  modeled  multicenter  binary-response  studies  with  a  lo¬ 
gistic  model  that  allows  the  treatment  log  odds  ratio  effect  to  vary  among  centers.  They 
parameterized  the  model  in  terms  of  the  logit  for  one  group  (identified  as  a  placebo)  and 
the  log  odds  ratio  comparing  it  to  the  second  group.  Conditional  on  those  parameters,  they 
assumed  independent  binomial  distributions  for  the  two  groups  in  each  center.  For  the  logit 
and  log  odds  ratio,  they  assumed  exchangeability  among  centers,  with  a  bivariate  normal 
prior  having  the  five  hyperparameters  unspecified  but  with  their  own  second-stage  prior. 
They  treated  the  normal  mean  vector  and  covariance  matrix  as  independent,  with  improper 
uniform  priors  for  the  means.  They  used  various  inverse  Wishart  prior  distributions  for 
the  covariance,  and  conducted  a  sensitivity  study  to  study  the  extent  to  which  posterior 
inferences  depended  on  that  choice.  The  marginal  posterior  distributions  of  the  mean  and 
variance  of  the  log  odds  ratio  component  then  describe  the  difference  between  treatments 
and  its  heterogeneity  among  centers. 

Skene  and  Wakefield  also  suggested  forming  a  predictive  distribution  for  the  log  odds 
ratio  for  a  new  center  that  was  not  part  of  the  study  but  could  be  considered  exchangeable 
with  the  ones  in  the  study.  This  provides  a  way  of  assessing  how  the  treatment  would 
perform  compared  with  placebo  in  a  new  center. 

Warn  et  al.  (2002)  proposed  Bayesian  approaches  to  the  meta-analysis  issue  discussed 
in  Section  13.3.6  of  allowing  variability  in  the  difference  of  proportions  and  relative  risk. 
This  addresses  the  difficulty  of  using  normal  distributions  for  parameters  that  have  bounds 
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on  their  values.  Consider  the  model  for  binomial  parameters  whereby  for  study  /, 


P(Y,\  =  1)  =  Jin,  P(Yn=l)  =  jtii+Si, 


where  <5/  =  Tin  —  Jin,  assuming  that  {<5, }  have  a  N(8,  r2)  distribution.  To  reflect  that  ji,\ 
and  7T/2  are  constrained  to  [0,  1],  Warn  et  al.  (2002)  expressed  P(Y,2  =  1)  as 

P(Y,2  =  1)  =  7T/i  +  min[max(<$,,  -Jin),  1  - 

the  increment  to  n-,\  ensuring  that  Jil2  falls  in  [0,  1].  For  other  prior  distributions,  Warn  et  al. 
(2002)  suggested  a  uniform  distribution  over  (—  1 ,  1 )  for  8  and  a  uniform  distribution  over 
(0,  2)  for  r,  which  contains  all  the  plausible  values  for  r.  Possible  priors  for  {tt,- i  }  included 
a  beta  distribution  with  uniform  (1,  100)  hyperpriors  on  the  beta  parameters,  which  are 
unimodal  but  relatively  uninformative.  This  approach  also  has  some  awkward  aspects.  For 
instance,  whenever  <5,  is  sampled  outside  the  range  [—Jin,  1  —  Jin],  Jij  is  set  to  0  or  1, 
resulting  in  spikes  of  probability  at  these  values. 

Casella  and  Moreno  (2005)  and  Efron  (1996)  also  proposed  Bayesian  methods  for 
summarizing  information  from  several  2x2  tables.  Efron  used  empirical  Bayesian  methods 
to  summarize  odds  ratios  from  41  different  trials  of  a  surgical  treatment  for  ulcers.  His 
method  permits  selection  from  a  wide  class  of  priors  in  the  exponential  family. 


13.7.3  Example:  Bayesian  Analyses  for  a  Multicenter  Trial 

Skene  and  Wakefield  illustrated  their  methodology  with  a  Bayesian  analysis  of  Table  13.7, 
from  the  study  to  compare  placebo  with  a  drug  for  curing  a  fungal  infection  that  we  analyzed 
with  GLMMs  in  Section  13.3.5.  They  noted  that  the  data  show  evidence  of  a  decrease  in 
the  treatment  effect  when  the  placebo  success  rate  increases.  Varying  the  Wishart  second 
stage  prior  for  the  covariance  of  the  normal  prior  for  the  placebo  logit  and  the  log  odds 
ratio  had  an  effect  on  the  posterior  distribution  of  the  variance  of  the  log  odds  ratio  but  little 
effect  on  the  posterior  distribution  of  its  mean. 

The  posterior  distribution  for  the  mean  of  the  log  odds  ratios  had  mean  falling  between 
0.82  and  0.99  for  the  various  priors,  and  standard  deviation  falling  between  0.42  and  0.52. 
By  contrast,  the  GLMM  analysis  with  treatment  x  center  interaction  estimated  that  the 
log  odds  ratios  have  a  mean  of  0.75,  with  standard  error  0.32.  The  Bayesian  analyses  also 
reported  posterior  probabilities  of  a  negative  mean  treatment  effect  (typically  about  0.03), 
analogous  to  the  one-sided  P-value  in  the  GLMM  analysis  (which  was  0.01). 

With  the  various  priors,  the  mean  of  the  posterior  distribution  of  the  variance  of  the  log 
odds  ratio  varied  between  0.49  and  1.03,  reflecting  potentially  considerable  heterogeneity 
among  centers  in  the  true  effect.  For  a  new  center  for  the  study  summarized  by  Table 
13.7,  the  predictive  density  for  the  log  odds  ratio  was  considerably  widerthan  the  posterior 
density  for  the  mean  log  odds  ratio.  With  a  typical  prior,  it  gave  probability  0.19  of  a 
negative  value,  the  relatively  large  value  reflecting  heterogeneity  among  centers. 

13.7.4  Bayesian  GLMMs  and  Marginal  Models 

Bayesian  methods  have  been  used  to  approximate  ML  fitting  of  GLMMs.  Use  of  a  flat 
prior  distribution  yields  a  posterior  density  that  is  a  constant  multiple  of  the  likelihood 
function.  Then,  Markov  chain  Monte  Carlo  (MCMC)  methods  for  approximating  intractable 
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posterior  distributions  can  approximate  the  likelihood  function  (Zeger  and  Karim  1991). 
An  approximation  for  the  mode  of  the  posterior  distribution  approximates  the  ML  estimate. 

A  danger  is  that  improper  prior  distributions  have  improper  posteriors  for  many  models 
for  categorical  data  (Natarajan  and  McCulloch  1995).  In  using  MCMC,  we  may  fail  to 
realize  that  the  posterior  is  improper.  It  is  safer  to  use  a  proper  but  relatively  diffuse  prior. 
With  a  multivariate  normal  prior,  we  could  also  use  a  hierarchical  approach  in  which  the 
covariance  matrix  has  an  inverse  Wishart  distribution.  However,  the  posterior  mode  need 
not  be  close  to  the  ML  estimate,  and  Markov  chains  may  converge  slowly. 

Of  course,  Bayesian  methods  can  be  used  not  only  to  approximate  frequentist  results 
but  also  as  a  standard  approach  for  those  who  prefer  the  Bayesian  paradigm,  whether  it 
be  for  estimating  population-averaged  or  cluster-specific  effects.  For  example,  Daniels  and 
Gatsonis  (1999)  used  multilevel  GLMs  to  analyze  geographic  and  temporal  trends  with 
clustered  longitudinal  binary  data.  This  built  on  hierarchical  modeling  ideas  introduced  by 
Wong  and  Mason  (1985). 

We’ve  seen  that  logistic  regression  does  not  extend  easily  to  the  modeling  of  multivariate 
categorical  responses,  because  of  a  lack  of  a  simple  logistic  analog  of  the  multivariate  nor¬ 
mal.  However,  O’Brien  and  Dunson  (2004)  formulated  a  multivariate  logistic  distribution 
incorporating  correlation  parameters  and  having  marginal  logistic  distributions.  They  used 
this  in  a  Bayesian  analysis  of  marginal  logistic  regression  models,  showing  that  proper  pos¬ 
terior  distributions  typically  exist  even  with  an  improper  uniform  prior  for  the  regression 
parameters. 

For  modeling  multivariate  correlated  ordinal  responses,  Chib  and  Greenberg  ( 1 998)  used 
a  multivariate  probit  model.  A  multivariate  normal  latent  random  vector  with  cutpoints 
along  the  real  line  defines  the  categories  of  the  observed  discrete  variables.  The  correlation 
among  the  categorical  responses  is  induced  through  the  covariance  matrix  for  the  underlying 
latent  variables.  Webb  and  Forster  (2008)  parameterized  the  model  in  such  a  way  that 
conditional  posterior  distributions  are  standard  and  easily  simulated.  They  focused  on 
model  determination  through  comparing  posterior  marginal  probabilities  of  the  model 
given  the  data  (integrating  out  the  parameters). 

Chen  and  Shao  ( 1 999)  briefly  reviewed  other  Bayesian  approaches  to  handling  such  data. 
They  employed  a  scale  mixture  of  multivariate  normal  links,  a  class  of  models  that  includes 
the  multivariate  probit,  t  link,  and  logit.  Chen  and  Shao  offered  both  a  noninformative  and 
an  informative  prior  and  gave  conditions  ensuring  that  the  posterior  is  proper.  Note  13.13 
lists  other  references  dealing  with  Bayesian  multivariate  categorical  data  analysis. 


NOTES 

Section  13.1:  Random  Effects  Modeling  of  Clustered  Categorical  Data 
13.1  Rasch,  clustered  binary  references:  For  further  discussion  of  the  Rasch  model  and  ways 
of  estimating  its  parameters,  see  Andersen  (1980,  Sec.  6.4)  and  Fischer  and  Molenaar 
(1995).  Haberman  (1977b)  showed  that  ML  estimators  can  achieve  consistency  when  both 
n  and  T  grow  at  suitable  rates.  Early  work  on  GLMMs  for  a  categorical  response  includes 
Anderson  and  Aitkin  (1985),  Bartholomew  (1980),  Bock  and  Aitkin  (1981),  Chamberlain 
(1980),  Gilmour  et  al.  (1985),  Pierce  and  Sands  (1975),  and  Stiratelli  et  al.  (1984).  Caffo 
and  Griswold  (2006)  and  Caffo  et  al.  (2007)  discussed  probit  random  effects  models  and 
related  models  using  the  t  link.  Hedeker  and  Gibbons  (2006),  Molenberghs  and  Verbeke 
(2005),  Neuhaus  (1992),  and  Pendergast  et  al.  ( 1996)  surveyed  methods  for  clustered  binary 
data,  including  GLMMs  and  marginal  models.  McCullagh  (2008)  argued  that  most  natural 
sampling  schemes  involving  binary  random  effects  models  are  biased,  an  implication  being 
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that  the  effects  for  such  models  and  corresponding  marginal  models  are  not  necessarily  the 
relevant  effects. 

13.2  Conditional  ML  versus  random  effects:  In  models  with  covariates,  Neuhaus  and  Lesper- 
ance  ( 1 996)  noted  that  conditional  ML  may  lose  efficiency  compared  with  the  random  effects 
approach  when  cluster  sizes  are  small  and  covariates  have  strong  positive  within-cluster  cor¬ 
relation.  As  that  correlation  approaches  + 1 ,  the  covariate  effect  resembles  a  between-cluster 
one,  which  the  conditional  ML  approach  cannot  estimate.  The  matched-pairs  case  referred 
to  in  Section  1 3. 1 .2  in  which  the  conditional  ML  estimate  equals  the  random  effects  estimate 
has  within-cluster  covariate  correlation  =  —  1,  as  depending  on  the  order  of  viewing  the 
observations,  x,  changes  from  0  to  1  or  from  1  to  0;  then,  no  efficiency  loss  occurs. 

13.3  Nonnormal  random  effects:  Alternatives  to  the  normal  random  effects  distribution  are 
conjugate  random  effects,  a  mixture  of  normals,  and  a  combination  of  conjugate  random 
effects  and  normal  random  effects.  See  Caffo  et  al.  (2007),  Molenberghs  et  al.  (2010),  and 
Lee  et  al.  (2006),  Wang  and  Louis  (2003)  and  Parzen  et  al.  (201  1)  showed  that  when  the 
random  effects  in  a  logistic  model  have  a  certain  scale  mixture  of  normal  distributions,  the 
marginal  model  also  has  logistic  form. 


Section  13.3:  Examples  of  Random  Effects  Models  for  Binary  Data 

13.4  Capture-recapture,  heterogeneity:  For  other  analyses  permitting  heterogeneous  odds  ra¬ 
tios  in  several  2x2  tables,  see  Casella  and  Moreno  (2005),  Efron  (1996),  Liu  and  Pierce 
(1993),  and  Skene  and  Wakefield  (1990).  For  further  discussion  of  capture-recapture  model¬ 
ing,  see  Bishop  et  al.  (1975,  Chap.  6),  Chaoet  al.  (2001),  Cormack  (1989),  Coull  and  Agresti 
(1999),  Darroch  et  al.  (1993),  Fienberg  et  al.  (1999),  Hook  and  Regal  (1995),  Pledger  et 
al.  (2010),  Royle  et  al.  (2007),  and  the  many  references  in  these  articles.  Similarities  exist 
between  this  problem  and  the  related  problem  of  estimating  the  binomial  index  n  when 
observing  independent  bin(«,  n)  counts  with  unknown  n  and  n\  see  DasGupta  and  Rubin 
(2005)  and  Grevstad  (2006)  and  references  in  those  articles.  Relatively  flat  log  likelihoods 
also  occur  with  other  models  that  permit  capture  heterogeneity  (Burnham  and  Overton  1 978), 
such  as  a  beta-binomial  model. 

13.5  Meta-analysis:  For  alternative  random  effects  approaches  to  meta-analysis,  see  Burr  and 
Doss  (2005),  Efron  (1996),  Emerson  et  al.  (1993),  Rucker  et  al.  (2009),  Shuster  (2010), 
Stijnen  et  al.  (2010),  and  Tian  et  al.  (2009). 

13.6  Ecological  inference:  King  (1997)  used  random  effects  models  as  part  of  a  solution  for 
analyzing  aggregated  categorical  data,  the  problem  of  ecological  inference.  Chambers  and 
Steel  (2001 )  discussed  early  work  by  Leo  Goodman  on  this  problem  and  proposed  a  simpler 
semiparametric  approach.  See  also  Wakefield  (2004). 

13.7  Joint  response  models:  For  longitudinal  bivariate  binary  responses.  Ten  Have  and  Morabia 
(1999)  simultaneously  modeled  bivariate  log  odds  ratios  and  univariate  logits.  Multivariate 
responses  sometimes  have  both  continuous  and  categorical  components.  For  random  effects 
modeling  of  such  data,  see  Catalano  and  Ryan  (1992),  Gueorguieva  and  Agresti  (2001), 
and  Molenberghs  and  Verbeke  (2005,  Chap.  24).  See  Gueorguieva  (2001)  fora  multivariate 
generalization. 

13.8  Spatial  data:  For  examples  of  random  effects  models  for  spatial  categorical  response  data, 
see  Banerjee  et  al.  (2004),  Heagerty  and  Lele  (1998),  Hoeting  et  al.  (2000),  Kneib  and 
Fahrmeir  (2006),  and  Miller  and  Franklin  (2002). 


Section  13.4:  Random  Effects  Models  for  Multinomial  Data 


13.9  Ordinal  response:  The  same  predictor  structure  as  in  ( 1 3. 1 8)  holds  with  other  links  for  which 
a  common  effect  for  each  logit  is  plausible,  such  as  adjacent-categories  logits  (Hartzel  et  al. 
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200 1  a, b).  With  the  complementary  log-log  link,  the  likelihood  function  has  closed  form 
with  a  log  gamma  random  effects  distribution  (Crouchley  1 995  Ten  Have  1 996).  Agresti  and 
Natarajan  (2001),  Hedeker  and  Gibbons  (1994;  2006,  Chap.  10),  Hartzel  et  al.  (200la,b), 
Hedeker  (2008),  and  Tutz  and  Hennevogl  (1996)  presented  ordinal  response  models  with 
random  effects. 

13.10  Nominal  response;  For  multinomial  extensions  of  the  Rasch  model,  see  Andersen  (1980,  pp. 
272-284;  1995)  and  Conaway  (1989).  Daniels  and  Gatsonis  (1997),  Hartzel  et  al.  (2001b), 
Hedeker  (2008),  and  Hedeker  and  Gibbons  (2006,  Chap.  1 1 )  presented  nominal-response 
models  with  random  effects.  For  discrete  choice  models  (Section  8.5)  with  random  effects, 
see  Chen  and  Kuo  (2001),  McFadden  and  Train  (2000),  Natarajan  et  al.  (2000),  and  Train 
(2009). 

Section  13.5:  Multilevel  Models 

13.11  Multilevel  references:  Early  work  on  multilevel  modeling  for  categorical  data  includes 
Aitkin  et  al.  (1981),  Anderson  and  Aitkin  (1985),  and  Wong  and  Mason  (1985).  For  later 
work,  see  Browne  et  al.  (2005),  Carlin  et  al.  (2001),  Daniels  and  Gatsonis  (1997,  1999), 
Gelman  and  Hill  (2006,  Ch.  14,  15),  Gibbons  and  Hedeker  (1997),  Goldstein  (2010),  Guo 
and  Zhao  (2000),  Heagerty  and  Zeger  (2000)  for  a  marginal  approach,  Hedeker  (2008)  for 
a  survey  for  nominal  and  ordinal  data,  Longford  ( 1 993),  Skrondal  and  Rabe-Hesketh  (2003, 
2004),  and  Vermunt  (2003)  for  multilevel  latent  class  models,  and  Yang  et  al.  (2000). 

Section  13.6:  GLMM  Fitting,  Inference,  and  Prediction 

13.12  Marginally  specified  model:  A  GLMM  determines  the  marginal  relationship  (averaged 
over  random  effects)  between  the  mean  response  and  explanatory  variables.  Conversely, 
Heagerty  (1999)  noted  that  a  marginal  model  for  the  mean  implicitly  determines  the  form  of 
the  fixed  portion  of  the  linear  predictor  in  a  random  effects  model.  The  GLMM  (13.1)  has 
linear  predictor,  xf,fi  +  zf,Ui.  A  more  general  form  A„  +  zJ,Uj  corresponds  to  a  particular 
marginal  model.  Here,  A„  is  a  function  of  the  marginal  linear  predictor  and  the  random 
effects  distribution.  It  is  implicitly  defined  by  the  integral  equation  that  links  the  marginal 
and  conditional  means.  Caffoet  al.  (2007)  gave  related  discussion,  and  Swihart  et  al.  (2012) 
discussed  equivalent  copula  models. 

Section  13.7:  Bayesian  Multivariate  Categorical  Modeling 

13.13  Bayes  multivariate:  Dey  et  al.  (2000)  edited  a  collection  of  articles  that  provided  Bayesian 
analyses  for  GLMs,  often  in  a  multivariate  setting.  For  instance,  in  that  volume  Gelfand 
and  Ghosh  surveyed  the  subject,  Albert  and  Ghosh  reviewed  item-response  modeling.  Chib 
modeled  correlated  binary  data,  Chen  and  Dey  modeled  correlated  ordinal  data,  and  Landrum 
and  Normand  gave  a  case  study  using  Bayesian  ordinal  probit  and  logit  models. 


EXERCISES 

Applications 

13.1  Refer  to  the  heaven/hell  matched-pairs  data  of  Table  11.14  and  Exercise  1 1 .8. 

a.  Fit  the  random  intercept  model  (13.3).  Interpret  p. 

b.  Compare  fi  and  its  SE  for  this  approach  to  the  conditional  ML  approach. 

c.  Refer  to  the  two  logistic  models  used  in  Exercise  11.8.  Explain  why  the 
population-averaged  and  subject-specific  effects  differ  so  much  for  these  data. 
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13.2  Refer  to  Table  4. 1 1  on  the  three-point  shooting  of  Ray  Allen.  In  game  i,  suppose 
that  yt  —  number  made  out  of  attempts  is  a  bin(w, ,  7r,  )  variate  and  {y,}  are 
independent. 

a.  Fit  the  model,  logit(7r,)  =  a.  Find  and  interpret  ft, .  Does  the  model  appear  to  fit 
adequately? 

b.  Fit  the  model,  logit(7T;)  —  a  +  u where  {;r,  }  are  independent  N(0,  a2).  Use 
a  and  d  to  summarize  Allen’s  shooting.  Is  there  evidence  that  this  model  fits 
better  than  the  one  in  part  (a)? 

13.3  For  Table  9.3,  let  y„  =  1  when  subject  i  used  substance  t.  Table  13.14  shows  output 
for  the  logistic-normal  model 

logit[P(T/,  =  1 1«;)]  =  P,  +  «/. 

Interpret  a  and  the  effect  that  compares  use  of  cigarettes  (t  =  2)  and  marijuana 
(t  =  3).  How  is  the  focus  different  from  that  for  the  loglinear  model  (AC,  AM,  CM) 
used  in  Section  9.2.4?  If  a  —  0,  which  loglinear  model  would  have  the  same  fit  as 
this  GLMM? 


Table  13.14  Output  for  Exercise  13.3  on  Cigarettes,  Alcohol,  and  Marijuana  Use 


Subj  ects 

2276 

Parameter 

Estimate 

Std  Error 

t  Value 

Max  Obs  Per  Subject 

3 

betal 

4 . 2227 

0 . 1824 

23 . 15 

Parameters 

4 

beta2 

1 . 6209 

0 . 1207 

13.43 

Quadrature  Points 

200 

beta3 

-0.7751 

0 . 1061 

-7.31 

Log  Likelihood 

-3311 

sigma 

3 . 5496 

0 . 1627 

21 . 82 

13.4  For  the  student  survey  data  in  Table  10.1,  (a)  analyze  using  GLMMs,  and  (b) 
compare  results  and  interpretations  to  those  with  marginal  models  in  Exercise  12.2. 

13.5  Consider  model  (13. 1 1)  for  the  attitudes  toward  abortion  data  in  Table  13.3. 

a.  Fit  the  model.  If  your  software  uses  Gauss-Hermite  numerical  integration, 
report  {ft,)  and  their  standard  errors  for  5,  25,  100,  and  500  quadrature  points, 
and  comment  on  convergence. 

b.  Under  the  constraint  a  =  0,  explain  why  the  fit  is  the  same  as  (i)  an  ordinary 
logistic  model  treating  the  three  responses  for  each  subject  as  if  they  were  inde¬ 
pendent  responses  for  three  separate  subjects,  (ii)  an  ordinary  loglinear  model 
(GSj,  GS2,  GS3)  of  mutual  independence  of  responses  in  the  three  situations 
(Si ,  S2,  S3),  given  G  =  gender. 

c.  Fit  one  of  the  models  in  (b).  Interpret,  and  explain  why  [0,  —  0U }  are  quite 
different  from  the  estimates  in  Section  13.3.2  for  the  model  allowing  a  >  0. 

13.6  Consider  the  crossover  study  in  Table  12.9  (Exercise  12.6). 
a.  Fit  the  model 


logit[E(T,(r)f  =  1  |k, •(*>)]  =  ak  +  p,  +  um, 
where  {»,(*)}  are  independent  N( 0,  a2).  Interpret  [0, }  and  a . 


(13.22) 
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b.  We  can  also  add  period  or  carryover  effects.  Add  two  period  effects  to  model 
(13.22)  (e.g.,  the  first-period-effect  parameter  adds  to  the  model  when  t  =  A 
and  k  =  1,2 ,t  =  B  and  k  =  3,  4,  and  t  =  C  and  k  —  5,  6).  Check  whether  the 
fit  improves.  Interpret. 

c.  For  the  model  in  (a),  compare  estimates  of  Pb  ~  Pa  and  Pc  —  Pa  and  SE  values 
to  those  using  (i)  a  marginal  model,  and  (ii)  conditional  logistic  regression, 
treating  subject  terms  in  model  (13.22)  as  fixed  effects. 

13.7  For  Table  6.6  on  admissions  decisions  for  graduate  school  applicants,  let  y,-„  =  1 
when  a  subject  in  department  i  of  gender  g  ( 1  =  females,  0  =  males)  is  admitted. 

a.  For  the  fixed  effects  model,  logit[P(T,s  =  1)]  =  a  +  Pg  +  P? ,  P  =  0.173 
(SE  =  0.112).  The  corresponding  model  (13.14)  in  which  departments  are  a 
normal  random  effect  has  p  =0.163  (SE  =  0.1 1 1).  Interpret  these. 

b.  The  model  of  form  ( 1 3. 14)  allowing  the  gender  effect  to  vary  by  department  has 
P  =  0.176  (SE  =  0.132),  with  ah  =  0.20.  Interpret.  Explain  why  the  standard 
error  of  $  is  larger  than  with  the  other  analyses. 

c.  The  sample  conditional  odds  ratios  between  gender  and  whether  admitted  vary 
between  0  and  oo.  By  contrast,  predicted  odds  ratios  for  the  interaction  random 
effects  model  do  not  vary  much.  Explain  why. 

13.8  For  the  clinical  trial  in  Table  6. 1 1 ,  let  jr„  =  P(Yit  =  1 1 «,-)  denote  the  probability  of 
success  for  treatment  t  in  center  i. 

a.  The  random  intercept  model  (13.13)  has  $  =  1.52  (SE  =  0.70)  and  a  —  1.9. 
Interpret. 

b.  From  Section  6.5.2,  the  fixed  effects  analog  of  this  model  (replacing  a  +  u,  by 

a-,)  has  &|  =  =  —oo,  corresponding  to  ft\,  =7x^—0  for  each  treatment.  By 

contrast,  the  random  effects  model  has  a  +  &\  =  —3.78  (using  NLMIXED  in 
SAS)and^n  =  0.047  and  tc\2  =0.011  in  center  1 .  Explain  how  this  model  can 
have  jii,  >  0  in  centers  having  no  successes. 

13.9  For  the  subject-specific  model  in  Section  13.3.3  for  the  depression  study,  verify  that 
the  estimated  difference  in  time  effect  slopes  between  the  new  and  standard  drugs 
for  treating  depression  are  (a)  1.018  (SE  =  0.192)  with  the  GLMM  approach,  and 
(b)  1.156  (SE  =  0.222)  with  conditional  ML. 

13.10  For  marginal  model  (11.16)  for  Table  1 1 .6  on  premarital  and  extramarital  sex,  Table 
13.15  shows  results  of  fitting  a  corresponding  random  intercept  model.  Interpret  p. 
Why  is  the  estimate  so  different  from  p  in  Section  1 1.3.4  for  the  marginal  model? 


Table  13.15  Output  for  Exercise  13.10  on  GSS  Items 


Subj  ects 

1337 

Parameter 

Estimate 

Std  Error 

t  Value 

Max  Obs  Per  Subject 

2 

interl 

-1 . 9702 

0 . 1164 

-16 . 93 

Parameters 

5 

inter2 

0 . 9840 

0 . 0659 

14 . 93 

Quadrature  Points 

100 

inter3 

1.1737 

0 . 0761 

15.43 

-2  Log  Likelihood 

5000.5 

beta 

4 .2387 

0 . 1975 

21.46 

sigma 

1.8612 

0 . 1422 

13 .09 
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13.11  Landis  and  Koch  (1977)  showed  ratings  by  seven  pathologists  who  separately 
classified  1 18  slides  regarding  the  presence  and  extent  of  carcinoma  of  the  uterine 
cervix,  using  a  five-point  ordinal  scale.  (Table  14.1  is  a  collapsing  of  their  table 
that  combines  the  first  two  categories  and  the  last  three  categories.)  For  slide  i  with 
rater  t,  the  model 

logit[P(T„  <  j\u.j)]  =  dj  +  fa  +  Uj 

fitted  (with  /?7  =  0)  assuming  that  {«,}  are  independent  N( 0,  a2)  has  =  2.907 
( SE  =  0.344)  and  a  =3.8.  The  corresponding  marginal  model,  fitted  using  in¬ 
dependence  working  correlations,  has  GEE  estimate  ^6  =  1-252  (SE  =  0.161). 
Interpret  $(,  for  each  model.  Explain  why  for  the  GLMM  is  much  larger  in 
absolute  value.  Discuss  the  differences  in  assumptions  and  interpretations  for  the 
two  models. 

13.12  From  Section  1 1 .4.2,  the  ML  estimates  of  main  effects  in  the  quasi-symmetry  model 
relate  to  conditional  ML  estimates  for  a  subject-specific  model  using  baseline- 
category  logits.  For  the  migration  data  in  Table  1 1 .7,  A*  —  A.[  =  1 .74  when  con¬ 
straints  set  A.4  =  A.4  =  0.  For  a  given  subject,  the  estimated  odds  of  living  in  the 
Northeast  instead  of  the  West  at  age  16  were  exp(l  .74)  =  5.70  times  the  odds  in 
2010.  Explain  why  the  corresponding  population-averaged  odds  ratio  estimate  is 
[(370/34 1  )/(29 1  /39 1)|  =  1 .46,  and  explain  how  the  estimates  can  differ  so  much. 

13.13  Refer  to  Section  13.3.8  on  boys’  attitudes  toward  the  leading  crowd.  Table  13.16 
shows  results  for  a  sample  of  schoolgirls.  Fit  model  (13.16)  and  interpret.  Summa¬ 
rize  the  estimated  variability  and  correlation  of  random  effects. 

Table  13.16  Data  for  Exercise  13.13  on  Girls  and  Leading  Crowd 


,,  „  (M,  A)  for  Second  Interview" 

(M,  A)  for  - 


First  Interview 

(Yes,  Positive) 

(Yes,  Negative) 

(No,  Positive) 

(No,  Negative) 

Yes,  positive 

484 

93 

107 

32 

Yes,  negative 

112 

110 

30 

46 

No,  positive 

129 

40 

768 

321 

No,  negative 

74 

75 

303 

536 

“  M,  membership;  A,  attitude. 

Source:  J.  S.  Coleman,  Introduction  to  Mathematical  Sociology.  London:  Free  Press  of  Glencoe,  1964,  p,  168. 


13.14  Generalize  model  (13.16)  to  apply  simultaneously  to  Table  13.8  for  boys  and  Table 
13.16  for  girls,  using  a  gender  main  effect  but  the  same  membership  effect  and 
the  same  attitude  effect  for  each  gender.  Fit  the  model.  Use  the  maximized  log 
likelihood  to  compare  with  a  more  general  model  having  different  membership 
effects  and  different  attitude  effects  for  each  gender.  Interpret. 

13.15  Table  13.17  reports  results  from  a  study  to  estimate  the  number  N  of  people  infected 
during  a  1 995  hepatitis  A  outbreak  in  Taiwan.  The  27 1  observed  cases  were  reported 
from  records  based  on  a  serum  test  taken  by  the  Institute  of  Preventive  Medicine  of 
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Table  13.17  Data  for  Exercise  13.15  on  Hepatitis  Infections 


P  Q  E 

Observed 

Count 

Logistic-Normal 
ML  Fit 

0  0  0 

— 

(487,  (X)) 

0  0  1 

63 

61.0 

0  1  0 

55 

58.0 

0  1  1 

18 

17.0 

1  0  0 

69 

68.0 

1  0  1 

17 

20.0 

1  1  0 

21 

19.0 

1  1  1 

28 

28.0 

Source:  Data  from  Chao  et  al.  (2001). 


Taiwan  (P),  records  reported  by  the  National  Quarantine  Service  (Q),  and  records 
based  on  questionnaires  administered  by  epidemiologists  (E). 

a.  Using  the  model  of  mutual  independence  with  P,  Q,  and  E,  find  N  and  a  95% 
profile  likelihood  interval  for  N. 

b.  The  random  effects  model  of  Section  13.3.4  has  fit  shown  in  Table  13.17,  for 
which  a  —  2.9.  The  log  likelihood  is  relatively  fiat,  and  N  —  4551  with  a  95% 
profile  likelihood  interval  of  (758,  oo)  (Coull  and  Agresti  1999).  Since  the 
interval  in  part  (a)  is  much  narrower,  is  it  necessarily  more  reliable?  Explain. 

13.16  Analyze  the  crossover  data  of  Table  1 1 .22  using  a  random  effects  model.  Interpret. 

13.17  The  analyses  in  Section  1 3.3.5  describing  heterogeneity  in  multicenter  clinical  trials 
extend  to  ordinal  responses.  Using  random  effects  models,  analyze  the  2  x  3  x  8 
table  in  Hartzel  et  al.  (2001a),  shown  also  at  the  text  website. 

13.18  Exercises  6.18  and  6.19  referred  to  published  meta-analyses.  For  one  of  these, 
conduct  a  meta-analysis  that  uses  methods  of  this  chapter.  Interpret. 

13.19  For  the  example  in  Section  13.4.6,  in  interpreting  effects,  Hedeker  (2008)  reported 
sample  proportions  in  each  response  category  at  each  time  for  each  group.  He  noted 
that  over  time,  (a)  there  was  a  general  decrease  in  street  living  and  an  increase  in 
independent  living  for  both  groups,  (b)  the  increase  in  independent  living  occurs 
sooner  for  the  certificate  group  than  the  control  group,  (c)  regarding  community 
living,  this  increases  for  the  control  group  and  decreases  for  the  certificate  group. 
Explain  how  the  estimates  in  Table  13.12  suggest  these  interpretations. 

13.20  Refer  to  Exercise  12. 19  and  the  data  for  a  clinical  trial  for  toenail  infection. 

a.  Fit  a  logistic-normal  random  intercept  model  for  the  binary  endpoint.  Discuss 
how  the  treatment  effect  estimate  at  baseline  and  its  SE  depend  on  the  fitting 
method  you  use  (e.g.,  on  the  number  of  quadrature  points). 

b.  Compare  results  to  those  of  a  marginal  model  analysis  for  the  data. 
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13.21  Analyze  Table  1 2.8  with  age  and  maternal  smoking  as  predictors  using  a  (a)  logistic- 
normal  model,  (b)  marginal  model,  and  (c)  transitional  model.  Explain  how  the 
interpretation  of  the  maternal  smoking  effect  differs  for  the  three  approaches. 

13.22  For  Exercise  6.29  about  a  meta-analysis  on  the  effect  of  rosiglitazone  on  myocardial 
infarction,  conduct  a  fully  Bayesian  analysis.  Justify  the  choice  of  priors.  Compare 
results  and  interpretations  to  the  fixed  effects  analysis. 

Theory  and  Methods 

13.23  For  the  voting  example  in  Section  13.3.1,  using  supplementary  information  im¬ 
proves  predictions.  Let  q,  denote  the  true  proportion  of  votes  for  Kerry  (the  Demo¬ 
cratic  candidate)  in  state  i  in  the  2004  election.  Consider  the  model 

logit[P(y„  =  1  |m,  )]  =  logit(<7,)  +  a  + 

where  {<?,  }  are  known  and  {«,  }  are  independent  N(0,  a2).  When  a  =  0,  show 
fti  =  qi  exp(a)/[l  —  <?,  +  <?;  exp(a)].  Compared  to  {<7, },  explain  how  ft,  then  shifts 
up  or  down  depending  on  how  the  overall  Democratic  vote  compares  in  the  current 
poll  to  the  previous  election  (i.e.,  depending  on  a).  When  also  a  =  0,  showtr,  =  q,. 

13.24  For  a  binary  response,  consider  the  random  effects  model 

logit[P(T„  =  1  |m,  )]  =  a  +  p,  +Ui,  t  =  1 . T , 

where  {«,  }  are  independent  /V(0,  cr2),  and  the  marginal  model 

logit[P(T,  =  !)]=«  +  #,  t  —  1 . T. 

For  identifiability,  pT  —  p*  —  0.  Explain  why  all  p,  =  0  implies  that  all  p*  =  0. 
Is  the  converse  true? 

13.25  The  GLMM  for  binary  data  using  probit  link  function  is 

<t>-l[P(Yl,=  l\ui)]  =  xlfi  +  zlui, 

where  <t>  is  the  N( 0,  1)  cdf  and  h,  has  /V(0,  T)  pdf,  /(«,;  E). 

a.  Show  that  the  marginal  mean  is 

P(Y,  =  1)  =  J P(Z-  ZJtUj  <  xlfi)  f(Ui-T)duh 

where  Z  is  a  standard  normal  variate  that  is  independent  of  a, . 

b.  Since  Z  —  zftUj  has  a  N( 0,  1  +  zj,  Zz„)  distribution,  deduce  that 

=  1)]  =  x]tP  [1  +  zlXzit]-'/2  . 
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Hence,  the  marginal  model  is  a  probit  model  with  attenuated  effect.  In  the 
univariate  random  intercept  case,  show  that  the  marginal  effect  equals  that  from 
the  GLMM  divided  by  \/l  +  o’2. 

13.26  In  the  Rasch  model,  logit[P(Vj,  =  1)]  =  a,  +  /},,  treat  a,  as  a  fixed  effect. 

a.  Assuming  independence  of  responses  for  different  subjects  and  for  different 
observations  on  the  same  subject,  show  that  the  log  likelihood  is 

EE  u,y„  +  EE  fryu  logt 1  +  exP(“'  +  A)]. 

it  it  it 

b.  Show  that  the  likelihood  equations  are  y+t  =  E,  P(Xu  =  1)  and  yi+  = 
E,  P(Yii  =  1)  for  all  i  and  t.  Explain  why  conditioning  on  {y,+}  yields  a 
distribution  that  does  not  depend  on  {a,  }. 

13.27  Consider  the  matched-pairs  random  effects  model  (13.3).  For  given  po,  let  <5o  be 
such  that  /i  i2  =  ft\2  +  <$o  and  A21  =  «2i  —  <$o  satisfies  log(/t2i/Ai2)  —  Po-  Suppose 
{{Lij}  has  nonnegative  log  odds  ratio.  Explain  why: 

a.  This  is  the  fit  of  the  model  assuming  p  =  p0. 

b.  The  likelihood-ratio  statistic  fortesting  Ho :  ft  =  fio  in  this  model  equals 

2  (ft  12  log  +  «21  log  n 21  ]  . 

V  ft\2  +  ft2\  —  do/ 

c.  The  likelihood-ratio  test  of  Hq\  P  =  0  is  the  test  of  symmetry. 

13.28  Explain  why  the  logistic-normal  model  is  not  helpful  for  capture-recapture  exper¬ 
iments  with  only  two  captures. 

13.29  In  recent  U.S.  Presidential  elections,  in  each  state  more  wealthy  voters  tend  to 
be  more  likely  to  vote  Republican,  yet  states  that  are  wealthier  in  an  aggregate 
sense  are  more  likely  to  go  Democrat  for  the  electoral  college.  Sketch  a  plot  that 
illustrates  how  this  instance  of  Simpson's  paradox  could  occur.  Specify  a  GLMM 
with  random  effects  for  states  that  could  be  used  to  analyze  data  for  a  sample  of 
voters  using  their  state  of  residence,  their  household  income,  and  their  vote  in  an 
election.  Explain  how  the  model  could  be  generalized  to  allow  the  income  effect 
to  vary  by  state.  [For  details,  see  Gelman  and  Hill  (2007,  Sec.  14.2).] 

13.30  Summarize  advantages  and  disadvantages  of  using  a  GLMM  approach  compared 
with  a  marginal  model  approach.  Describe  conditions  under  which  parameter  es¬ 
timators  are  consistent  for  (a)  marginal  models  using  GEE.  (b)  marginal  models 
using  ML,  and  (c)  GLMMs  using  ML. 


CHAPTER  14 


Other  Mixture  Models  for  Discrete  Data 


In  Chapters  1 1  through  1 3  we  introduced  methods  for  observations  that  are  correlated  due  to 
repeated  measurement  and  other  forms  of  clustering.  The  generalized  linear  mixed  models 
(GLMMs)  of  Chapter  13  assume  normal  random  effects.  They  describe  heterogeneity  by 
replacing  the  linear  predictor  by  a  normally  distributed  mixture  of  linear  predictors.  In  this 
chapter  we  present  GLMM-type  models  that,  except  for  one  case,  use  nonnormal  mixture 
distributions. 

In  Section  14.1  we  present  latent  class  models.  These  treat  a  contingency  table  as  a 
finite  mixture  of  unobserved  tables  generated  under  a  conditional  independence  structure  at 
categories  of  a  latent  variable.  In  Section  14.2  we  present  a  related  nonparametric  approach 
to  fitting  GLMMs  that  uses  an  unspecified  discrete  quantitative  distribution  for  the  random 
effects  distribution. 

In  Section  14.3  we  present  models  for  clustered  binomial  responses  that  use  the  beta 
distribution  to  describe  heterogeneity  of  binomial  parameters.  The  resulting  beta-binomial 
distribution  has  variance  function  for  which  quasi-likelihood  methods  are  also  available. 
In  Section  14.4  we  model  count  responses  using  the  gamma  distribution  to  describe  het¬ 
erogeneity  of  Poisson  parameters.  The  resulting  negative  binomial  regression  model  cor¬ 
responds  to  a  Poisson  GLMM  having  a  log-gamma  distributed  random  effect.  It  is  an 
alternative  to  the  GLMM  for  Poisson  responses  with  normal  random  effects,  a  model 
presented  in  Section  14.5. 

14.1  LATENT  CLASS  MODELS 

Ordinary  GLMMs  create  a  mixture  of  linear  predictor  values  using  a  latent  variable,  the 
unobserved  random  effect  vector,  that  is  assumed  to  have  a  normal  distribution.  By  contrast, 
latent  class  models  use  a  mixture  distribution  that  is  qualitative  rather  than  quantitative. 
The  basic  model  assumes  existence  of  a  latent  categorical  variable  such  that  the  observed 
response  variables  are  conditionally  independent,  given  that  variable. 
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Z 

Figure  14.1  Conditional  independence  graph  for  latent  class  model. 


14.1.1  Independence  Given  a  Latent  Categorical  Variable 

For  categorical  response  variables  (K| ,  Y2,  . . . ,  Yr),  the  latent  class  model  assumes  a  la¬ 
tent  categorical  variable  Z  such  that  for  each  possible  sequence  of  response  outcomes 
(y\, . . . ,  yr)  and  each  category  z  of  Z, 


P(Y  i  =  y\, . . .  ,YT  —  yT\Z  —  z)  =  P(Y ,  =  y,  |Z  =  z)  •  •  •  P(YT  =  yT\Z  =  z). 


The  model  was  introduced  by  Lazarsfeld  in  1950  and  described  by  Lazarsfeld  and  Henry 
(1968).  Figure  14.1  shows  the  conditional  independence  graph  for  the  model.  A  latent  class 
model  summarizes  probabilities  of  classification  P(Z  —  z)  in  the  latent  classes  as  well 
as  conditional  probabilities  P(Y,  =  y,\Z  =  z)  of  outcomes  for  each  Y,  within  each  latent 
class.  These  are  the  model  parameters.  The  model  is  an  analog  for  categorical  responses 
and  latent  variables  of  the  factor  analysis  model  with  a  common  factor  for  multivariate 
normal  responses. 

The  latent  class  model  is  sometimes  plausible  when  the  observed  variables  are  several 
indicators  of  some  concept,  such  as  prejudice,  religiosity,  or  opinion  about  an  issue.  An 
example  is  Table  13.3,  in  which  subjects  gave  their  opinions  about  whether  abortion  should 
be  legal  in  various  situations.  Perhaps  an  underlying  latent  variable  describes  one’s  basic 
attitude  toward  legalized  abortion,  such  that  given  the  value  of  that  latent  variable,  responses 
on  the  observed  variables  are  conditionally  independent.  For  instance,  there  may  be  three 
latent  classes:  one  for  those  who  always  oppose  legalized  abortion  regardless  of  the  situation, 
one  for  those  who  always  support  it,  and  one  for  those  whose  response  depends  on  the 
situation. 

The  T -dimensional  contingency  table  cross-classifying  (T|, . ..,  Yr)  is  observed.  The 
(T  +  l)-dimensional  table  that  cross-classifies  it  with  the  latent  variable  is  an  unobserved 
table.  Denote  the  number  of  categories  of  each  Y,  by  /  and  the  number  of  latent  classes  of  Z 

by  q.  For  the  observed  table,  let  7TV| . yr  —  P{Y\  =  y\, . . .  ,YT  =  yr)-  The  model  assumes 

a  multinomial  distribution  over  its  IT  cells.  Each  cell  probability  satisfies 

Tty, . n  =  P(Yl  =  )h,  Yr  =  yT\Z  =  z)P(Z  =  z). 


The  conditional  independence  factorization  for  the  latent  class  model  states  that 


JtV 


<1 

E 

Z—  I 


\~\P{Y,  =  y,\Z  =  z) 


L/=i 


P(Z  =  z). 


(14.1) 


This  is  a  nonlinear  model  for  the  IT  multinomial  probabilities. 
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The  latent  class  model  implies  that  the  loglinear  model  symbolized  by  ( Y \Z , 
YjZ,  . . . ,  YjZ)  holds  for  the  unobserved  table.  The  model  makes  no  assumption  about 
the  \Y,Z}  associations  but  assumes  that  the  {Y,\  are  mutually  independent  within  each 
category  of  Z. 

More  generally,  the  latent  variable  can  be  multivariate.  For  example,  for  the  membership 
and  attitude  toward  the  “leading  crowd”  data  that  we  analyzed  in  Section  1 3.3.8  with  model 
(13.16)  with  correlated  normal  random  effects,  Goodman  (1974)  used  an  analogous  model 
with  two  associated  binary  latent  variables. 

14.1.2  Fitting  Latent  Class  Models 

Denote  the  counts  in  the  observed  table  by  {wv . yT).  For  the  IT  cells  in  that  table,  the 

kernel  of  the  multinomial  log  likelihood  is  the  sum  over  these  cells, 

X/fv, . yT  lo8  . yT  •  (14.2) 

Substituting  (14.1),  we  can  maximize  this  with  respect  to  the  model  parameters 
{P(Y,  =  y, \Z  =  z)(  and  P(Z  =  z)}  using  the  EM  algorithm  (Goodman  1974)  or  the 
Newton-Raphson  algorithm  (Flaberman  1979,  Chap.  10). 

The  EM  algorithm  has  two  steps  in  each  iteration.  The  E  (expectation)  step  in  iteration  s 

calculates  pseudo-counts  {n\st\  Yl  .(for  the  unobserved  table  using  {ny . . }  and  a  working 

conditional  distribution  for  (Z|K|, . . . ,  YT)  described  shortly.  The  M  (maximization)  step 
treats  {n[^  Vj  „}  as  data  and  maximizes  the  pseudo-likelihood,  fitting  the  loglinear  model 
(Y\Z,  YiZ, . . . ,  YjZ).  The  fit  (/y<v')  V/  _}  of  that  model  in  the  unobserved  table  then  de¬ 
termines  the  new  working  conditional  distribution  of  (Z|F|,  . . . ,  Yj)  to  apply  to  {ny . VT( 

for  the  £-step  of  the  next  iteration.  This  allocates  the  observed  data  to  pseudo-counts  in  the 
unobserved  cells  in  proportion  to  this  fit,  using 


,(.5+1) 

vi,....vr 


=  ny\ 


vr 


CO 

. >'T,Z 


n 


=  1  P-y  I,. 


These  are  entries  in  the  unobserved  table  for  iteration  (5  +  1).  They  are  used  as  pseudo-data 
for  the  AF-step  of  iteration  (s  +  1).  Eventually,  the  algorithm  converges  to  fitted  values 
for  the  unobserved  table  that  satisfy  mutual  independence  within  each  latent  class,  and 
such  that  the  corresponding  fitted  probabilities  in  the  observed  table  (i.e.,  added  over  the 
latent  categories)  maximize  the  log  likelihood  (14.2).  These  fitted  values  also  induce  ML 
estimates  of  the  latent  class  model  parameters  {P(Yt  =  y,|Z  =  z)(  and  \P(Z  —  z)(. 

The  EM  algorithm  is  computationally  simple  and  stable.  Each  iteration  increases  the 
likelihood.  However,  its  convergence  can  be  slow.  A  more  problematic  issue  is  that  the  log 
likelihood  can  have  local  maxima.  With  either  the  EM  or  the  Newton-Raphson  algorithm, 
you  should  perform  the  fitting  process  a  few  times  with  different  starting  guesses  for  the 
parameter  values.  The  EM  algorithm  tends  to  be  less  sensitive  to  the  choice  of  starting 
values.  As  q  increases,  multiple  local  maxima  are  more  likely  and  the  danger  also  increases 
of  a  lack  of  identifiability. 

Standard  errors  for  model  parameter  estimates  result  from  inverting  the  model’s  esti¬ 
mated  information  matrix.  This  is  a  by-product  of  the  Newton-Raphson  algorithm  but  not 
the  EM  algorithm.  One  way  to  obtain  standard  errors  with  the  EM  algorithm  applies  a 
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useful  formula  of  Louis  (1982)  for  the  observed  information.  It  equals  the  expected  value 
of  the  observed  information  for  the  loglinear  model  for  the  unobserved  table  minus  the 
expected  value  of  the  information  for  the  conditional  distribution  of  Z  given  the  observed 
data.  Lang  (1992)  gave  related  results. 

Chi-squared  statistics  comparing  observed  cell  counts  to  fitted  values  test  the  model  fit. 
The  residual  df=  lT  —  qT (/  —  1 )  —  q,  since  model  (14.1)  describes  IT  —  1  multinomial 
probabilities  using  (/  —  1)  parameters  [P{Y,  —  yt\Z  =  z),  y,  =  1, . . . ,  /  —  1 }  at  each  of  qT 
combinations  of  z  and  t  values  for  the  latent  variable  and  the  response  indicator,  and  q  —  1 
parameters  [P{Z  —  z)}.  Often,  the  nature  of  the  variables  suggests  a  value  for  q,  usually 
quite  small  (2  to  4).  Otherwise,  you  can  start  with  q  =  2,  and,  if  the  fit  is  inadequate, 
increases  by  steps  of  1  as  long  as  the  fit  shows  substantive  improvement. 

14.1.3  Example:  Latent  Class  Model  for  Rater  Agreement 

Table  14.1  shows  results  for  seven  pathologists  who  classified  each  of  118  slides  on 
the  presence  or  absence  of  carcinoma  in  the  uterine  cervix.  For  modeling  interobserver 
agreement,  the  conditional  independence  assumption  of  the  latent  class  model  is  often 
plausible.  With  a  blind  rating  scheme,  ratings  of  a  given  subject  or  unit  by  different 
pathologists  are  independent.  If  subjects  having  true  rating  in  a  given  category  are  relatively 
homogeneous,  then  ratings  by  different  pathologists  may  be  nearly  independent  within  a 
given  true  rating  class.  Thus,  one  might  posit  a  latent  class  model  with  q  —  2  classes,  one 


Table  14.1  Diagnoses  of  Carcinoma  and  Fits  of  Latent  Class  Models" 


Pathologist 

Count 

Fit 

A 

B 

C 

D 

E 

F 

G 

q  =  1 

9=2 

-Ci 

II 

u> 

0 

0 

0 

0 

0 

0 

0 

34 

1.1 

23.0 

33.8 

0 

0 

0 

0 

1 

0 

0 

2 

1.6 

6.6 

2.0 

0 

1 

0 

0 

0 

0 

0 

6 

2.2 

12.7 

6.3 

0 

1 

0 

0 

0 

0 

1 

1 

2.8 

1.7 

1.5 

0 

1 

0 

0 

1 

0 

0 

4 

3.3 

3.6 

3.0 

0 

1 

0 

0 

1 

0 

I 

5 

4.2 

0.5 

4.7 

1 

0 

0 

0 

0 

0 

0 

2 

1.4 

3.0 

2.1 

1 

0 

1 

0 

1 

0 

1 

1 

1.6 

0.2 

0.2 

1 

1 

0 

0 

0 

0 

0 

2 

2.8 

1.7 

1.3 

1 

1 

0 

0 

0 

0 

1 

1 

3.5 

0.3 

1.6 

1 

1 

0 

0 

1 

0 

0 

2 

4.2 

0.5 

2.9 

1 

1 

0 

0 

1 

0 

1 

7 

5.3 

3.7 

6.5 

1 

1 

0 

0 

1 

1 

1 

1 

1.4 

2.6 

1.4 

1 

1 

0 

1 

0 

0 

1 

1 

1.3 

0.1 

0.1 

1 

1 

0 

1 

1 

0 

1 

2 

2.0 

4.3 

2.6 

1 

1 

0 

1 

1 

1 

1 

3 

0.5 

3.1 

2.0 

1 

1 

1 

0 

1 

0 

1 

13 

3.3 

11.5 

9.6 

1 

1 

1 

0 

1 

1 

1 

5 

0.9 

8.4 

8.7 

1 

1 

1 

1 

1 

0 

1 

10 

1.2 

13.5 

13.6 

1 

1 

1 

1 

1 

1 

1 

16 

0.3 

9.9 

12.3 

"Fits  obtained  with  Latent  Gold  (Statistical  Innovations,  Belmont,  MA).  I,  yes;  0,  no. 
Source:  Based  on  data  in  Landis  and  Koch  (1977),  not  showing  empty  cells. 
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Table  14.2  Likelihood-Ratio  Statistics  for  Latent  Class  Models  Fitted  to 
Table  14.1“ 


Number  of 
Latent  Classes 

Model 

Deviance  (G2) 
Statistic 

df 

1 

Mutual  independence 

476.8 

120 

2 

Latent  class 

62.4 

112 

Rasch  mixture 

67.6 

118 

3 

Latent  class 

15.3 

104 

Rasch  mixture 

27.5 

116 

4 

Latent  class 

6.4 

96 

Rasch  mixture  (quasi-symmetry) 

23.7 

114 

“Models  fitted  with  Latent  Gold  (Statistical  Innovations,  Belmont,  MA). 


for  subjects  whose  true  rating  is  positive  and  one  for  subjects  whose  true  rating  is  negative. 
This  model  expresses  the  27  joint  distribution  of  the  seven  ratings  as  a  mixture  of  two  27 
distributions,  one  for  each  true  rating  class. 

Table  14.2  shows  results  of  fitting  some  latent  class  models,  including  another  mixture 
model  to  be  introduced  in  Section  1 4.2.5.  Because  the  observed  table  is  sparse,  the  deviance 
is  mainly  useful  for  comparing  models.  This  is  an  informal  comparison,  though,  because  the 
chi-squared  distribution  does  not  apply  for  comparing  deviances  of  models  with  different 
numbers  of  latent  classes.  A  model  with  q  classes  is  a  special  case  of  a  model  with  q*  >  q 
classes  in  which  P(Z  =  z)  =  0  for  z  >  q  and  hence  falls  on  the  boundary  of  the  parameter 
space.  Ordinary  chi-squared  likelihood-ratio  tests  require  parameters  to  fall  in  the  interior 
of  the  parameter  space  [i.e.,0  <  P(Z  =  z)  <  1  forz  =  1, . . . ,  q*].  The  actual  large-sample 
null  distribution  is  a  mixture  of  chi-squared  distributions  (Molenberghs  and  Verbeke  2007). 

Table  14.1  also  shows  the  fitted  values  for  latent  class  models  with  q  =  1, 2,  3,  for  the 
cells  having  positive  counts.  (Each  empty  cell  also  has  a  fitted  value,  not  shown  here.)  The 
model  with  q  =  1  latent  class  is  the  model  of  mutual  independence  of  the  seven  ratings. 
This  is  equivalent  to  the  loglinear  model  (T| ,  Y2, . . . ,  I7).  It  fits  poorly,  as  we  would  expect. 
With  q  —  2,  considerable  evidence  remains  of  lack  of  fit.  For  instance,  the  fitted  count  for 
a  negative  rating  by  each  pathologist  is  23.0,  compared  with  an  observed  count  of  34.  (The 
small  G2  that  Table  14.2  reports  for  this  model  does  not  imply  a  good  fit;  from  Section 
3.2.3,  G2  tends  to  be  highly  conservative  when  most  fitted  values  are  very  close  to  0.)  The 
model  with  q  —  3  seems  to  fit  adequately. 

Studying  the  estimated  probability  P(Y,  =  1  \Z  =  z)  of  a  carcinoma  diagnosis  for  each 
pathologist,  conditional  on  a  given  latent  class  z,  helps  illuminate  the  nature  of  these  classes. 
Table  14.3  reports  these  for  the  three-class  model.  They  suggest  that  ( 1 )  the  first  latent  class 
refers  to  cases  that  all  pathologists  (except  occasionally  B)  agree  show  no  carcinoma;  (2) 
the  second  latent  class  refers  to  cases  of  strong  disagreement,  whereby  C,  D,  and  F  rarely 
diagnose  carcinoma  but  B,  E,  and  G  usually  do;  and  (3)  the  third  latent  class  refers  to 
cases  in  which  A,  B,  E,  and  G  agree  show  carcinoma  and  C  and  D  usually  agree.  The 
estimated  proportions  in  the  three  latent  classes  are  P{Z  =  1)  =  0.37,  P(Z  =  2)  =  0. 18, 
and  P(Z  =  3)  =  0.45.  The  model  estimates  that  18%  of  the  cases  fall  in  the  problematic 
disagreement  class. 

A  danger  with  latent  variable  models,  shared  by  factor  analysis  for  continuous  responses, 
is  the  temptation  to  interpret  latent  variables  too  literally.  For  example,  here  it  is  tempting 
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Table  14.3  Estimated  Probabilities  of  Diagnosing  Carcinoma,  for  Latent  Class  Model  and 
Rasch  Mixture  Model  with  Three  Classes" 


Model 

Latent  Class 

Pathologist 

A 

B 

C 

D 

E 

F 

G 

Latent 

1 

0.057 

0.138 

0.000 

0.000 

0.055 

0.000 

0.000 

class 

2 

0.513 

1.00 

0.000 

0.058 

0.751 

0.000 

0.631 

3 

1.000 

0.981 

0.858 

0.586 

1.000 

0.476 

1.000 

Rasch 

1 

0.022 

0.150 

0.001 

0.000 

0.047 

0.000 

0.022 

mixture 

2 

0.611 

0.923 

0.052 

0.015 

0.774 

0.009 

0.61 1 

3 

0.994 

0.999 

0.853 

0.617 

0.997 

0.483 

0.994 

"Results  obtained  with  Latent  Gold  (Statistical  Innovations,  Belmont,  MA). 


to  treat  latent  class  3  as  cases  truly  having  carcinoma  and  a  rating  of  carcinoma  given  that 
the  subject  falls  in  latent  level  3  as  being  a  correct  judgment.  Realize  the  tentative  nature  of 
the  latent  variable  and  be  careful  not  to  make  the  error  of  reification — treating  an  abstract 
construction  as  if  it  has  actual  existence. 

Using  the  model  parameter  estimates  and  Bayes’  theorem,  we  can  also  estimate 
P(Z  =  z\Y,  —  y, )  and  P(Z  —  z\Y\  =  y\, . ..,  Yj  =  yr)-  If  a  pathologist  makes  a  “yes” 
rating,  for  instance,  what  is  the  estimated  probability  that  the  subject  is  in  the  latent  class 
for  which  agreement  on  a  positive  rating  usually  occurs?  We  perform  further  analysis  in 
Section  14.2.6  after  studying  a  simpler  model.  We  could  also  use  methods  of  Chapter  13, 
such  as  a  GLMM  with  a  normal  rather  than  categorical  latent  variable.  A  logistic-normal 
random  intercept  model,  for  instance,  yields  subject-specific  comparisons  of  P(Y,  =  1)  for 
various  t. 


14.1.4  Example:  Latent  Class  Models  for  Capture-Recapture 

We  next  apply  latent  class  models  to  capture-recapture  modeling  for  estimating  popula¬ 
tion  size.  In  Section  13.3.4  we  used  a  logistic-normal  GLMM  for  this.  With  T  sampling 
occasions,  a  2r  contingency  table  displays  the  data,  with  scale  (captured,  not  captured)  at 
each  occasion.  A  prediction  of  the  population  size  equals  the  prediction  for  the  missing  cell 
count,  representing  subjects  not  captured  at  every  occasion,  added  to  the  counts  in  other 
cells. 

With  two  classes,  the  latent  class  model  treats  the  population  as  a  mixture  of  two 
types,  perhaps  determined  by  genetic  or  environmental  factors.  Homogeneity  of  capture 
probabilities  occurs  for  subjects  within  each  type,  but  the  type  of  any  given  subject  is 
unknown.  This  model  represents  a  compromise  between  the  mutual  independence  model, 
which  assumes  a  single  latent  class  and  complete  homogeneity,  and  the  logistic-normal 
GLMM,  which  assumes  a  continuous  mixture  of  capture  probabilities  rather  than  two 
classes. 

We  illustrate  with  the  data  set  on  snowshoe  hares  in  Table  13.6,  having  T  =  6  captures. 
The  model  of  mutual  independence  predicts  that  N  =  75.  Its  95%  profile  likelihood  confi¬ 
dence  interval  for  N  is  (70,  83).  The  latent  class  model  with  two  classes  has  N  =  85  and 
a  profile  likelihood  interval  of  (74,  106).  The  latent  class  model  with  three  classes  gives 
similar  results.  The  logistic-normal  GLMM  in  Section  13.3.4  gave  the  interval  (75,  154), 
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so  these  seem  too  short  to  be  trusted.  This  simple  latent  class  model  may  not  capture  all  the 
existing  heterogeneity. 

Possible  models  other  than  latent  class  or  parametric  random  effects  models  include 
loglinear  models  (Cormack  1989).  They  are  marginal  models,  applying  to  probabilities 
averaged  over  subjects.  Let  Y,  denote  the  binary  capture  variable  for  a  randomly  selected 
subject  at  occasion /.The  simplest  model,  (Ti,  Y2,  . . . ,  T7-),  treats  capture  events  as  mutually 
independent  and  is  equivalent  to  the  logistic-normal  model  (13.12)  with  a  =  0  and  latent 
class  model  (14.1)  with  q  —  1.  The  loglinear  model  (Y\Y2,  T1T3, . . . ,  Yj-\Yt)  allows  an 
association  between  pairs  of  capture  variables.  Alternatively,  a  simpler  model  with  Markov 
structure  (Y\Y2,  Y2Y2, . . . ,  Yr .  j  Yj )  or  with  the  same  association  for  each  pair  of  occasions 
may  be  useful  (Exercise  14.3). 

In  capture-recapture  experiments,  N  and  confidence  intervals  for  N  depend  strongly  on 
the  choice  of  model.  Standard  goodness-of-fit  criteria  are  of  limited  help.  Two  models  can 
fit  the  observed  counts  well,  yet  yield  quite  different  predictions  for  the  unobserved  count. 
For  instance,  for  the  snowshoe  hare  data,  the  loglinear  models  of  mutual  independence  and 
of  two-factor  association  both  fit  relatively  well  ( G 2  —  58.3,  df  =  56  for  mutual  indepen¬ 
dence  and  G2  =  32.4,  df  =  41  for  the  two-factor  model);  however,  their  N  values  are  75 
and  105. 

Simpler  models  usually  give  narrower  confidence  intervals  for  N,  through  the  usual 
benefits  of  model  parsimony.  This  is  not  necessarily  good  for  this  type  of  application.  A 
narrow  confidence  interval  for  N  is  desirable,  but  not  at  the  expense  of  severe  sacrifice 
in  the  actual  confidence  level.  Intervals  based  on  a  possibly  unrealistic  assumption  of 
subject  homogeneity  are  often  overly  optimistic.  Simulations  suggest  that  actual  coverage 
probabilities  can  then  be  well  below  nominal  levels  when  even  slight  model  misspecification 
occurs.  Allowance  for  heterogeneity  among  subjects  results  in  wider  intervals.  Severe 
population  heterogeneity  makes  reaching  useful  conclusions  difficult,  as  intervals  can  be 
very  wide  (Burnham  and  Overton  1978,  Coull  and  Agresti  1999). 

14.1.5  Example:  Latent  Class  Transitional  Models 

The  basic  latent  class  model  has  been  generalized  in  many  ways.  For  example,  Reboussin 
and  Ialongo  (2010)  modeled  drug  use  among  high  school  students  who  suffer  from  attention 
deficit  hyperactivity  disorder  (ADHD).  Their  model  consists  of  two  separate  latent  class 
models:  A  longitudinal  latent  transition  model  has  latent  classes  that  are  stages  of  marijuana 
use  and  describes  the  probability  of  transitioning  between  the  stages.  A  cross-sectional  latent 
class  predictor  model  empirically  constructs  ADHD  subtypes  and  describes  the  influence 
of  those  subtypes  on  the  transition  rates. 

Other  generalizations  that  focus  on  transitions  use  continuous  latent  variables  and  re¬ 
semble  multivariate  random  effects  models.  For  example,  in  a  longitudinal  aging  study, 
Lin  et  al.  (2008)  modeled  repeated  transitions  between  independence  and  disability  states 
of  activities  of  daily  living.  Their  multistate  transition  model  is  designed  for  the  analysis 
of  repeated  episodes  of  multiple  states  representing  different  health  status,  where  some 
states  (such  as  death)  are  absorbing.  Transitions  among  multiple  states  are  modeled  jointly 
using  multivariate  latent  variables.  A  state-specific  latent  variable  represents  an  individual’s 
tendency  to  remain  in  a  nonabsorbing  state,  beyond  the  time  explained  by  covariates,  and 
to  account  for  correlation  among  repeated  sojourns  in  the  same  state.  Correlation  among 
sojourns  across  different  states  is  accounted  for  by  the  correlation  between  the  different 
latent  variables. 
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14.2  NONPARAMETRIC  RANDOM  EFFECTS  MODELS 

In  spite  of  its  popularity  and  attractive  features,  the  normality  assumption  for  random  effects 
in  ordinary  GLMMs  can  rarely  be  closely  checked.  McCulloch  and  Neuhaus  (201 1)  noted 
that  distributions  of  predicted  values  are  highly  dependent  on  their  assumed  distribution 
and  are  not  reliable  indicators  of  the  true  random  effects  distribution.  An  obvious  concern 
of  this  or  any  parametric  assumption  for  the  random  effects  is  possibly  harmful  effects 
of  misspecification.  To  check  sensitivity  to  this  assumption,  we  can  fit  GLMMs  using 
alternative  or  more  general  random  effects  assumptions. 


14.2.1  Logistic  Models  with  Unspecified  Random  Effects  Distribution 

A  nonparametric  approach  (Aitkin  1999,  Heckman  and  Singer  1984)  guards  against  possi¬ 
bly  harmful  misspecification  effects.  This  uses  an  unspecified  random  effects  distribution 
on  a  finite  set  of  mass  points.  The  location  of  the  mass  points  and  their  probabilities  are 
parameters.  The  number  of  mass  points  can  be  fixed.  When  this  number  is  itself  unknown, 
we  treat  it  as  fixed  in  the  estimation  process  but  increase  it  sequentially  until  the  likelihood 
is  maximized.  The  maximization  usually  requires  relatively  few  mass  points.  Even  allowing 
a  continuous  mixture  distribution,  the  nonparametric  estimate  of  that  distribution  takes  a 
finite  number  of  points  (e.g.,  Lindsay  et  al.  1991).  In  fact,  fitting  a  model  having  only 
two  mass  points  often  results  in  fixed  effects  estimates  quite  similar  to  those  with  the  full 
maximization.  This  approach  is  useful  primarily  when  the  random  effects  distribution  is 
not  itself  of  direct  interest,  since  the  nonparametric  estimate  of  that  distribution  tends  to  be 
poor  even  for  very  large  samples. 

Model  fitting  is  actually  simpler  than  for  models  with  normal  random  effects,  since 
the  integral  that  determines  the  likelihood  function  simplifies  to  a  finite  sum.  However, 
this  approach  also  has  disadvantages.  For  instance,  with  multivariate  random  effects  it 
cannot  provide  simple  correlation  structure  as  the  normal  can.  Also,  the  ML  estimate  of  the 
random  effects  distribution  often  places  some  weight  at  ±oo.  Although  this  can  be  useful 
with  binary  data  for  identifying  a  subsample  for  which  the  estimated  response  probability 
equals  1  or  equals  0  for  all  observations  in  a  cluster,  it  is  not  then  possible  to  describe 
heterogeneity  with  an  estimated  variance  component. 


14.2.2  Example:  Attitudes  About  Legalized  Abortion 

To  illustrate  this  approach,  we  reanalyze  Table  13.3  on  attitudes  about  legalized  abortion. 
In  Section  13.3.2  we  fitted  the  logistic-normal  model. 


logit[E(y„  =  1|«,)]  =  a  +  /?,  +  yxt  +  (14.3) 


with  Xj  =  gender  (1  =  female)  and  parameters  {/?,}  representing  three  conditions  under 
which  abortion  might  be  legal. 

Treating  u,  instead  nonparametrically,  the  likelihood  maximizes  with  a  two-point 
mixture  distribution.  Estimated  abortion  item  effects  are  /Si  —  ft  =  0.83  (SE  =  0.16), 
Pi  —  ft  =  0.30  (SE  =  0.16),  and  /S|  —  $2  =  0.52  (SE  =  0.16).  Results  are  similar  to  those 
in  Table  13.3  for  the  normal  random  effects  approach. 
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Table  14.4  Number  of  Protozoa  Exposed  to  Poison  Dose  and 
Number  That  Died 


Poison 

Dose 

Exposed 

Dead 

Poison 

Dose 

Exposed 

Dead 

4.7 

55 

0 

5.1 

53 

22 

4.8 

49 

8 

5.2 

53 

37 

4.9 

60 

18 

5.3 

51 

47 

5.0 

55 

18 

5.4 

50 

50 

Source:  Follman  and  Lambert  (1989).  Reprinted  with  permission  from  the 
Journal  of  the  American  Statistical  Association. 


14.2.3  Example:  Nonparametric  Mixing  of  Logistic  Regressions 

Follman  and  Lambert  (1989)  analyzed  the  effect  of  the  dosage  of  a  poison  on  the  probability 
of  death  of  a  protozoan  of  a  particular  genus.  Table  14.4  shows  the  data.  They  assumed  two 
unobserved  types  of  that  genus. 

Let  7 r,(x)  denote  the  probability  of  death  at  log  dose  level  x  for  genus  type  i,  i  =  1,2. 
Let  p  denote  the  probability  a  protozoan  belongs  to  genus  type  1 .  Their  model  specifies 

7 r(x)  =  pn i(x)  +  (1  -  p) tt2(x),  where  logit[jr,(x)]  =  a,  +  fix, 

with  unknown  p.  The  curve  for  jt(x)  is  a  weighted  average  of  two  curves  having  the  same 
logistic  shapes  but  different  intercepts. 

The  ordinary  logistic  regression  model  is  the  special  case  p  =  1.  Its  fit,  logit[jr(x)]  = 
—68.4  +  42.  lx,  is  poor,  with  deviance  G2  =  24.7  (df  =  6).  The  fit  of  the  mixture  model  is 

jr(x)  =  0.347f  i  (x)  +  0.667T2(x),  with 


logit[7f,(x)]  =  -196.2  +  124. 8x,  logit[jf2(x)]  =  -205.7  +  124.8x. 

Figure  14.2  shows  the  fit.  This  is  much  better,  with  G2  =  3.4  (df  =  4);  that  is,  double 
the  maximized  log-likelihood  increases  by  21.3  by  adding  two  parameters:  an  additional 
intercept  and  the  probability  for  the  mixture.  Follman  and  Lambert  noted  that  with  eight 
dose  levels,  at  most  two  mixture  points  are  identifiable  for  this  model. 

The  ordinary  GLMM  assumes  a  normal  mixture  of  logistic  curves.  It  gives  a  deviance 
reduction  of  only  1.7  compared  to  the  ordinary  logistic  model  with  p  =  1. 

14.2.4  Is  Misspecifkation  of  Random  Effects  a  Serious  Problem? 

Is  it  worth  the  trouble  to  consider  alternatives  to  the  normality  assumption  for  random 
effects  in  GLMMs,  whether  they  be  parametric  or  nonparametric?  For  logistic  random 
intercept  models,  different  assumptions  for  the  random  effects  distribution  often  provide 
similar  results  for  estimating  the  regression  effects.  Choosing  an  incorrect  random  effects 
distribution  does  not  tend  to  bias  estimators  of  those  effects.  The  true  distribution  for  the 
random  effects  being  skewed  can  result  in  some  bias  for  the  normal  intercept  estimator 
(Neuhaus  et  al.  1992).  The  choice  of  random  effects  distribution  also  usually  has  little 
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Figure  14.2  Fit  of  binary  mixture  of  logistic  regressions  to  Table  14.4  [model  fitted  using  Latent  Gold  (Statistical 
Innovations,  Belmont,  MA)]. 


impact  on  efficiency  of  estimation.  Also,  using  a  nonparametric  approach  when  the  true 
distribution  is  normal  does  not  result  in  much  efficiency  loss  (Neuhaus  and  Lesperance 
1996). 

When  the  true  random  effects  distribution  is  far  from  normal,  there  can  be  some  efficiency 
loss  for  the  logistic-normal  estimator.  One  such  case  is  when  the  true  distribution  is  a  two- 
point  mixture  with  large  variance  component,  such  as  suggested  in  the  previous  example. 
Agresti  et  al.  (2004)  studied  this  with  various  models,  such  as  a  simple  one-way  random 
effects  model.  In  cluster  i,  let  >>,,  be  a  Bernoulli  variate  satisfying 

logit[/>(y,,  =  1  |m#  )]  =  a  +  Uj,  i=\,...,n,  t=l,...,T,  (14.4) 

where  var (h,-)  =  a2.  Simulated  samples  from  this  model  used  various  «,  T,  a,  and  ct, 
and  various  true  distributions  for  w,  including  normal,  uniform,  exponential,  and  binary. 
When  the  true  distribution  is  a  two-point  mixture,  the  normal  approach  loses  efficiency 
in  estimating  {/r,  =  P(Y,,  =  1|m,)},  more  so  as  a  and  T  increase.  For  example,  when 
n  =  T  =  30,  a  =  0,  and  the  mixture  has  probability  0.50  at  each  point,  the  expected  value 
of  1/1/  —  /r,  |  is  (0.06 1 , 0.023)  for  the  (normal,  nonparametric)  approach  when  a  =  1.0,  and 
(0.045,0.013)  when  ct  =  2.0. 

The  example  fromFollman  and  Lambert  ( 1989)  discussed  in  Section  14.2.3,  which  has  a 
covariate  but  T  =  1 ,  illustrates  the  potential  efficiency  loss  with  the  logistic-normal  GLMM. 
The  two-point  mixture  model  has  0  =  124.8  with  SE  =  25.2,  for  which  0/SE  =  4.9.  The 
normal  mixture  model  has  0  =  65.5  with  SE  =  19.5,  for  which  0/SE  =  3.4. 

Some  research  suggests  that  the  random  effects  distribution  has  to  be  highly  nonnormal 
for  the  normal  GLMM  to  suffer  in  bias  or  efficiency.  McCulloch  and  Neuhaus  (2011) 
noted  that  the  accuracy  of  predicted  random  effects  is  not  much  affected  by  mild-to- 
severe  violations  of  the  assumed  structure.  Assuming  different  distributions  for  the  random 
effects  can  yield  quite  different  predicted  values  yet  have  similar  performance  in  terms  of 
overall  accuracy  of  prediction;  in  fact,  they  noted  that  a  significantly  better  fitting  random 
effects  distribution  may  not  perform  better  for  prediction.  However,  Heagerty  and  Zeger 
(2000)  noted  that  other  types  of  misspecification  can  be  more  crucial.  Regarding  bias,  they 
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argued  that  sensitivity  to  the  random  effects  assumption  is  greater  for  estimating  regression 
parameters  in  random  effects  models  than  estimating  their  counterparts  in  corresponding 
marginal  models.  They  illustrated  this  with  a  model  violation  by  which  the  variance  of 
the  random  effects  depends  on  values  of  covariates.  They  concluded  that  between-cluster 
effects  may  be  more  sensitive  to  correct  specification  of  the  random  effects  distribution  than 
within-cluster  effects.  This  is  an  advantage  of  using  marginal  models  for  between-cluster 
effects. 

14.2.5  Rasch  Mixture  Model 

From  Section  13. 1.4,  for  subject  i  with  item  t  the  Rasch  model  for  a  binary  response  is 

logit[/>(y,(  =  l\ui)]=a  +  p,+Ui,  t=\,...,T  (14.5) 

with  a  constraint  on  {/!,}.  The  GLMM  treats  {«,  }  as  normal  random  effects.  Lindsay  et  al. 
(1991)  studied  this  model  when  //,  instead  can  assume  only  a  finite  number  q  of  values. 
Denote  its  distribution  by 


P(U  =ak)  =  pk,  k  —  \  , . ..  ,q, 

for  unknown  [ak }  and  \pk)  satisfying  a  constraint  for  identifiability,  such  as  Pk^k  =  0. 
This  model  is  called  a  Rasch  mixture  model.  As  in  other  random  effects  models,  «,•  is 
unobserved,  and  the  T  responses  are  assumed  conditionally  independent  at  each  fixed  u , 
value.  It  differs  from  the  ordinary  latent  class  model  for  binary  responses  having  q  latent 
classes  (Section  14.1),  since  it  assumes  structure  (14.5)  for  P(Y„  =  1|«,-)  whereas  latent 
class  model  (14.1)  assumes  no  structure  for  P(Y,  =  y,\Z  =  z). 

For  the  Rasch  mixture  model,  the  marginal  probability  of  a  sequence  of  responses 

(yi,  ...,yT)  is 


71  y\ . >t 


</  r 7 


n 


exp[y,(A  Tax)] 
1  +  exp  (fi,  +  ak) 


Substituting  this  in  the  multinomial  log  likelihood  (14.2),  we  can  estimate  { ak ,  pk\  and 
{/S, }  using  Newton-Raphson  or  EM  algorithms.  As  q  increases,  the  maximized  likelihood 
increases  and  the  fit  improves.  However,  Lindsay  et  al.  (1991)  showed  that,  with  T  items, 
the  likelihood  no  longer  changes  once  q  =  (T  +  1  )/2.  Then,  the  model  gives  the  same  fit  to 
the  2t  observed  table  as  the  quasi-symmetry  model  ( 1 1 .32).  Thus,  this  simpler  latent  class 
model  has  a  symmetric  conditional  association  structure  among  the  observed  variables. 


14.2.6  Example:  Modeling  Rater  Agreement  Revisited 

For  the  ratings  of  carcinoma  by  seven  pathologists  (Table  1 4. 1 ),  Table  1 4.2  also  summarizes 
the  fit  of  Rasch  mixture  models.  Here,  P(Yit  =  l|n,-)  in  (14.5)  denotes  the  probability  of 
a  carcinoma  diagnosis  for  pathologist  t  evaluating  slide  i.  With  q  =  3,  it  does  not  fit 
significantly  more  poorly  than  the  latent  class  model.  With  T  =1  raters,  the  discrete 
mixture  can  take  at  most  (T  +  l)/2  =  4  points.  The  model  with  q  =  4  is  equivalently  the 
quasi-symmetry  model.  It  does  not  seem  to  fit  better  than  with  q  =  3. 
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Pathologist  F  D  C  A  G  E  B 

Estimate  -3.70  -3.15  -1.87  1.48  1.48  2.26  3.52 

Comparison  - - -  - - - 

Figure  14.3  Pathologist  estimates  for  Rasch  mixture  model  and  results  of  90%  Bonferroni  simultaneous  com- 
pari  son. 


Figure  14.3  shows  [0,]  for  the  Rasch  mixture  model  with  q  —  3,  setting  J2,  fit  =  0. 
These  describe  variation  among  the  pathologists’  response  distributions  at  each  latent 
level.  For  a  given  latent  class,  for  instance,  the  estimated  odds  of  a  carcinoma  diagnosis 
for  pathologist  B  are  exp(3.52  —  1.48)  =  7.7  times  the  estimated  odds  for  pathologist  A. 
Pathologist  B  tends  to  make  a  carcinoma  diagnosis  most  often,  and  D  and  F  the  least.  The 
figure  also  shows  results  of  a  90%  Bonferroni  comparison  of  the  21  pairs  of  pathologists, 
based  on  Wald  intervals  for  all  pairwise  differences  $,  —  $s. 

For  pathologist  t ,  conditional  on  latent  level  k  for  a  slide, 


exp(jS,  +  ak)/[  1  +  exp(j6,  +  ak )] 


estimates  the  probability  of  a  carcinoma  diagnosis.  Table  14.3  reports  these,  which  use 
a\  =  -5.25,  iii  —  -  1 .02,  and  a3  =  3.63.  They  are  similar  to  the  estimates  for  the  ordinary 
latent  class  model  with  q  —  3  but  a  bit  smoother,  with  fewer  estimates  at  the  boundary. 
Again,  at  latent  level  1  pathologists  tend  not  to  diagnose  carcinoma,  at  level  2  many 
disagreements  occur,  and  at  level  3  pathologists  tend  to  diagnose  carcinoma.  The  estimated 
latent  class  proportions  are  p\  =  0.37,  pi  =0.19,  and  p3  =  0.43,  similar  to  the  ordinary 
latent  class  model. 

Model  (14.5)  implies  that  the  association  between  each  Y,  and  U  has  log  odds  ratio 
(ak  —  af)  for  levels  k  and  /  of  U.  For  instance,  in  the  third  latent  class  the  estimated  odds 
that  a  pathologist  diagnoses  carcinoma  are  exp[3.63  —  (—5.25)]  >  7000  times  those  in  the 
first  latent  class.  The  large  [ak  —  at  \  suggest  strong  association  between  each  pathologist’s 
rating  and  the  latent  variable.  This  induces  strong  association  between  pairs  of  pathologist 
ratings.  The  model-fitted  odds  ratios  between  pairs  of  raters  vary  between  about  7  and  400, 
but  confidence  intervals  reveal  that  these  estimates  are  very  imprecise.  However,  the  quite 
varied  {fi, )  suggest  that  substantial  marginal  heterogeneity  exists  among  the  seven  ratings. 
This  causes  heterogeneity  in  pairwise  levels  of  agreement. 

The  mutual  independence  model  is  the  special  case  of  the  Rasch  mixture  model  with 
q  =  1;  that  is,  p\  —  1.  For  Table  14.1  the  Rasch  mixture  model  with  q  =  3  has  only  four 
more  parameters  than  the  mutual  independence  model  (i.e.,  pk  and  ak ,  k  =  1 , 2).  Yet  it  fits 
well  and  has  simple  interpretations. 


14.2.7  Nonparametric  Mixtures  and  Quasi-symmetry 

A  distribution-free  approach  for  w,  with  the  Rasch  form  of  model  (14.5)  implies  the 
quasi-symmetry  loglinear  model  marginally  (Darroch  1981,  Tjur  1982).  Let  Y ,  denote  the 
sequence  of  T  responses  for  subject  i.  For  the  possible  outcomes  y  =  (  vi ,  •  •  • ,  Vr).  where 
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each  y,  =  1  or  0,  and  removing  a  and  the  constraint  on  {/},), 


P(Yi  =  y\ui)  =  f] 
/ 


expOS,  +Uj) 

>/ 

1 

_  1  +  exp(P,  +  Ui)_ 

_  1  +  exp  ip,  -)-  iij)_ 

exp  [a,  (g,  y,)  +  g,  y,P , 
n,[l  +  exp(/J,  +«,)] 


Let  F  denote  the  cdf  of  u, .  The  marginal  probability  of  sequence  y  for  a  randomly  selected 
subject  is  (suppressing  the  subject  label) 


Ky\ . ,vr 


=  EVP(Y  =  y\U)  =  exp 


exP  [»  (E,  Vf)] 

n,[l  +exp(y6,  +  «)] 


dF(u). 


This  probability  contributes  to  the  log  likelihood,  which  is  (14.2)  for  a  multinomial 
distribution  over  the  2T  cells  for  possible  y.  Regardless  of  the  choice  for  F,  the  integral 
is  complex.  However,  it  depends  on  the  data  only  through  y,.  A  more  general  model 
replaces  this  integral  by  a  separate  parameter  for  each  value  of  y,.  This  model  has  form 


log  71  yt . yT  y !  Pt  +  \vi+...+.v,  • 

r 


(14.6) 


The  final  term  represents  a  separate  parameter  at  each  value  of  y,. 

The  implied  marginal  model  (14.6)  has  interaction  term  that  is  invariant  to  any  permuta¬ 
tion  of  the  response  outcomes  y,  since  each  such  permutation  yields  the  same  sum,  3V • 
Thus,  it  is  the  loglinear  model  of  quasi-symmetry  (1 1.32).  No  matter  what  form  F  takes, 
the  marginal  model  has  the  same  main-effect  structure,  and  it  has  an  interaction  term  that 
is  a  special  case  of  the  one  in  (14.6).  Thus,  we  can  consistently  estimate  [P, )  using  the 
ordinary  ML  estimates  for  the  quasi-symmetry  model.  In  fact,  Tjur  (1982)  showed  that 
these  estimates  are  also  the  conditional  ML  estimates,  treating  {«,)  as  fixed  effects  and 
conditioning  on  their  sufficient  statistics.  The  interaction  parameters  in  model  (14.6)  result 
from  the  dependence  in  responses  among  variables,  due  to  heterogeneity  in  {«, }. 


14.2.8  Example:  Attitudes  About  Legalized  Abortion  Revisited 

We  illustrate  for  the  opinions  about  legalized  abortion  analyzed  with  a  GLMM  in  Section 
13.3.2  and  with  a  nonparametric  random  effects  approach  in  Section  14.2.2.  For  model 
(14.3),  estimated  within-subject  comparisons  fi,  —  fis  of  items  result  from  fitting  a  quasi- 
symmetric  loglinear  model.  Let  ng(y\,  yi,  V3)  denote  the  expected  frequency  for  gender  g 
making  response  y,  to  item  t,  t  =  1, 2,  3,  where  for  item  t,  y,  =  1  for  approval  and  0  for 
disapproval.  The  loglinear  model  is 

logjM.Vi ,  V2>  V3)  =  P\y\  +  fh.yi  +  A3T3  +  Pig  +  v,  +  >'2+ _vi •  (14.7) 

For  yi  +  V2  +  V3  =  k,  refers  to  all  cells  in  which  subjects  voiced  approval  for  k 
of  the  three  items,  k  =  0,  1,2,  3.  The  ML  fit,  which  has  G 2  =  10.2  with  df  =  9,  yields 
-  $2  =  0.521  (SE  =  0.154),  =  0.828  (SE  =  0.160),  and  p2  ~  Pi  =  0.307 
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(, SE  =  0.161).  These  are  similar  to  the  GLMM  estimates  (Table  13.3)  and  nonparamet- 
ric  random  effects  model  estimates  in  Section  14.2.1.  They  also  are  the  conditional  ML 
estimates  for  model  (14.3),  treating  {//,  }  as  fixed.  With  this  approach  or  conditional  ML, 
however,  we  cannot  estimate  between-groups  effects,  such  as  the  gender  effect  in  model 
(14.3).  [The  fa  parameter  in  model  (14.7)  refers  to  relative  sample  sizes  of  males  and 
females  and  is  not  the  same  as  the  y  gender  effect  in  (14.3).] 


14.3  BETA-BINOMIAL  MODELS 

The  beta-binomial  model  is  a  parametric  mixture  model  that  is  another  alternative  to  binary 
GLMMs  with  normal  random  effects.  As  with  other  mixture  models  that  assume  a  binomial 
distribution  at  a  fixed  parameter  value,  the  marginal  distribution  permits  more  variation  than 
the  binomial.  Thus,  a  model  using  the  beta-binomial  can  handle  overdispersion  occurring 
with  ordinary  binomial  models. 


14.3.1  Beta-Binomial  Distribution 

The  beta-binomial  distribution  results  from  a  beta  distribution  mixture  of  binomials.  Sup¬ 
pose  that  (a)  given  n,  Y  has  a  binomial  distribution,  bin(n,7r),  and  (b)  7r  has  a  beta 
distribution.  The  beta  pdf  (Sec.  1 .6.2)  is 

f{n\a\,a2)  =  1  (a‘-t  0  <  tt  <  1,  (14.8) 

7  r(a, )T(a2) 

with  parameters  ct\  >  0  and  a2  >  0,  for  the  gamma  function  !"(•).  Let 


a  I 


a  i  +  a2 

The  beta  distribution  for  it  has  mean  and  variance 


,  9  =  1  /  (a  i  +  a2). 


E(tt)  =  ii,  var(7 r)  =  /i(\  —  p)9/(  1  +  9). 

Marginally,  averaging  with  respect  to  the  beta  distribution  for  7 r ,  T  has  the  beta-binomial 
distribution.  Its  mass  function  is 


/ n\  B(ot\  +  y,n  +  ot2  —  y) 

p(y,ai,a2)  =  - — - - - .  y=0,  1,  ...,n, 

\yj  B(ct\ ,  a2) 


for  the  beta  function  B(a,  b)  =  Y{a)T{b)/  T(a  +  b).  In  terms  of  fj.  and  9,  the  beta-binomial 
mass  function  is 


(n  \  mrMB+kemurS'o  -  m  +  m 
nrM'+kO) 


p(y,H,9)  = 


(14.9) 
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It  is  easier  to  understand  the  nature  of  this  distribution  from  its  moments  than  from  its  mass 
function.  The  first  two  moments  are 

E(Y)  =  nil,  var(y)  =  «/z(l  -  p)[  I  +  (n  -  1)9 /(l  +  9)]. 

In  fact,  9/{  1  +9)=  1  /(a\  +ci2  +  1)  is  the  correlation  between  each  pair  of  the  individual 
Bernoulli  random  variables  that  sum  to  Y . 

As  9  —*■  0  in  the  beta  distribution,  var(jr)  — >  0  and  that  distribution  converges  to  a 
degenerate  distribution  at  (i.  Then  var(T)  — »  n/i{\  —  ji)  and  the  beta-binomial  distribution 
converges  to  the  bin(«,  p). 

14.3.2  Models  Using  the  Beta-Binomial  Distribution 

Models  using  the  beta-binomial  distribution  permit  p,  and  hence  E(Y),  to  depend  on 
explanatory  variables.  The  simplest  models  let  9  be  the  same  unknown  constant  for  all 
observations.  More  general  models  let  6  depend  on  covariates,  such  as  by  allowing  a 
different  9  for  each  group  of  interest  (Prentice  1986).  Models  can  use  any  of  the  usual  link 
functions  for  binary  data,  but  the  logit  is  most  common.  For  observation  i  with  «,  trials, 
assuming  that  y,  has  a  beta-binomial  distribution  with  index  n,  and  parameters  (p, ,  9),  the 
model  links  p,  to  explanatory  variables  by 

logit(p,)  =  a+  fiTXi. 

The  beta-binomial  is  not  in  the  natural  exponential  family,  even  for  known  9.  Articles  us¬ 
ing  beta-binomial  models  have  employed  a  variety  of  fitting  methods  (Note  14.4),  including 
Newton-Raphson. 

14.3.3  Quasi-likelihood  with  Beta-Binomial  Type  Variance 

A  related  but  simpler  approach  for  overdispersed  binary  counts  uses  quasi-likelihood  with 
similar  variance  function  as  the  beta-binomial.  The  quasi-likelihood  variance  function  is 

v(fii)  =  n/fiiil  —  M<)[1  +(«;  -  1  )p]  (14.10) 

with  |p |  <  1 .  Although  motivated  by  the  beta-binomial  model  with  its  correlation  between 
binary  components,  this  variance  function  results  merely  from  assuming  that  7r,  has  a 
distribution  with  var(7T;)  =  pp,(  1  —  p,). 

This  variance  function  also  results  from  assuming  a  common  correlation  p  between  each 
pair  of  the  n/  individual  Bernoulli  random  variables  that  sum  to  y,-,  without  specifically  as¬ 
suming  a  beta  mixture  (Altham  1978).  Suppose  that  7T,  =  P{Yj,  =  1)  =  1  —  P(Y/,  =  0),  for 
t  —  1, . ..,  andcorr(T(i,  Yit)  =  p  forx  t.  Then,  var(T,f)  =  7r,(l  —  7r(),  cov(T,5,  T,r)  = 
p7T,(l  —  7T,),  and 

val'(  Y”  )  =  Y  var(r'f)  +  2  Y  XI  cov(y'T’ 

^  t  '  t  S  <t 

=  «,  7T;  (1  -  7Tj)  +  «,  («/  -  1  )P7T,  ( 1  -  7 T,  )  =  n, 77/(1  -  7T/)[1  +  p(«,  -  1)]. 
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The  ordinary  binomial  variance  results  when  p  =  0.  Overdispersion  occurs  when  p  >  0. 

For  this  quasi-likelihood  approach,  Williams  (1982)  proposed  an  iterative  routine  for 
estimating  /?  and  the  overdispersion  parameter  p.  He  let  p  be  such  that  the  resulting  Pearson 
X2  that  sums  the  squared  Pearson  residuals  for  this  variance  function  equals  the  residual  df 
for  the  model.  This  requires  an  iterative  two-step  process  of  (1)  solving  the  quasi-likelihood 
equations  for  /?  for  a  given  p ,  and  then  (2)  using  the  updated  /},  solving  for  p  in  the  equation 
that  equates  X2  (which  depends  on  /?  and  p)  to  its  df. 

An  alternative  quasi-likelihood  approach,  presented  in  Section  4.7.3,  uses  the  simpler 
inflated  binomial  variance  function 

v(Hi)  =  (ptijUii  1  —  fij).  (14.11) 

The  ordinary  binomial  variance  has  cp  =  1  0  and  overdispersion  occurs  when  <p  >  \.  With 
this  approach,  j)  is  the  same  as  its  ML  estimate  forthe  ordinary  binomial  model.  Commonly, 
(j>  —  X2/df,  where  X2  is  the  Pearson  fit  statistic  forthe  binomial  model  (Finney  1947).  The 
standard  errors  for  the  overdispersion  approach  multiply  those  for  the  binomial  model 
by  01//2. 

Liang  and  McCullagh  (1993)  showed  several  examples  using  these  two  variance  func¬ 
tions.  A  plot  of  the  standardized  residuals  for  the  ordinary  binomial  model  against  the 
indices  {«,  }  can  provide  insight  about  which  is  more  appropriate.  When  the  residuals  show 
an  increasing  trend  in  their  spread  as  n,  increases,  the  beta-binomial-type  variance  func¬ 
tion  may  be  more  appropriate.  This  is  because  when  the  beta-binomial  variance  holds, 
the  residuals  from  an  ordinary  binomial  model  have  denominator  that  is  progressively  too 
small  as  n,  increases.  The  two  quasi-likelihood  approaches  are  equivalent  when  {«,}  are 
identical.  Only  when  the  indices  vary  considerably  might  results  differ  much.  Because  the 
variance  function  v(/x,)  =  4>n,p.,(\  —  /x,)  has  a  structural  problem  when  n,  —  1  (Section 
4.7.3)  and  has  less  direct  motivation,  we  prefer  quasi-likelihood  with  the  beta-binomial 
variance  function. 


14.3.4  Example:  Teratology  Overdispersion  Revisited 

Table  4.7  showed  results  of  a  teratology  experiment.  Female  rats  on  iron-deficient  diets 
were  assigned  to  four  groups.  Group  1  was  given  only  placebo  injections.  The  other  groups 
were  given  injections  of  an  iron  supplement  according  to  various  schedules.  The  rats  were 
made  pregnant  and  then  sacrificed  after  3  weeks.  For  each  fetus  in  each  rat’s  litter,  the 
response  was  whether  the  fetus  was  dead.  Because  of  unmeasured  covariates,  it  is  natural 
to  permit  the  probability  of  death  to  vary  from  litter  to  litter  within  a  particular  treatment 
group. 

Let  y,  denote  the  number  dead  out  of  the  «,  fetuses  in  litter/.  Let  7r,f  denote  the  probability 
of  death  for  fetus  t  in  litter  i.  We  use  the  model 


logit(7T„)  =  a  +  fan  +  ft  323/  +  fi4Z4,, 


where  =  1  if  litter  i  is  in  group  g  and  0  otherwise. 

First,  suppose  that  y,  is  a  bin(«, ,  i r,,)  variate,  independent  from  litter  to  litter.  This  treats 
all  litters  in  a  group  g  as  having  the  same  probability  of  death,  exp(a  +  Pg)/l  1  +  exp(o  + 
f)g)],  where /I;  =  0.  The  ML  estimates  are  a  =  1.14(S£  =  0.13),  P2  =  —3.32  (SE  —  0.33), 
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Table  14.5  Estimates  for  Several  Logistic  Models  Fitted  to  Table  4.7 


Type  of  Logistic  Model" 


Parameter 

Beta-bin.  ML 

QL(1) 

QL(2) 

GEE 

GLMM 

Intercept 

1.35  (0.24) 

1.21  (0.22) 

1.21  (0.27) 

1.14(0.28) 

1.80  (0.36) 

Group  2 

-3.11  (0.52) 

-3.37  (0.56) 

-3.32  (0.56) 

-3.37  (0.43) 

-4.51  (0.74) 

Group  3 

-3.87  (0.86) 

-4.59(1.30) 

-4.48(1.24) 

-4.58  (0.62) 

-5.86(1.19) 

Group  4 

-3.92  (0.68) 

8 

„  =0.241 
1+0 

-4.25  (0.85) 

-4.13(0.81) 

-4.25  (0.60) 

-5.59  (0.92) 

Overdispersion 

p  =0.192 

4>  =  2.86 

p  =  0. 185 

a  =  i.53 

"QL  is  quasi-likelihood  with  ( 1 )  beta-binomial-type  variance,  (2)  inflated  binomial  variance;  GEE  uses  exchange¬ 
able  working  correlations.  Values  in  parentheses  are  standard  errors. 


=  —4.48  ( SE  =  0.73),  /3q  =  —4. 13  ( SE  =  0.48).  However,  this  binomial  ML  approach 
has  evidence  of  overdispersion,  with  X2  =  154.7  and  G2  =  173.5  (df  =  54). 

By  contrast,  Table  14.5  shows  ML  estimates  and  standard  errors  for  the  beta-binomial 
model,  which  permits  heterogeneity  for  litters  in  a  group.  The  overdispersion  results  in 
inflated  SE  values  compared  with  binomial  ML.  For  the  beta-binomial  fit,  9/(1  +  9)  = 
0.241,  so  the  fit  treats  the  variance  of  T,  as 


var(Y;)  =  mni(  1  -  /*,•)[  1  +  0.241  («,  -  1)]. 

This  corresponds  roughly  to  a  doubling  of  the  variance  relative  to  the  binomial  with  a  litter 
size  of  5  and  a  tripling  with  n,  =  9. 

Table  14.5  also  shows  results  for  the  two  quasi-likelihood  approaches.  Estimates  and 
standard  errors  are  qualitatively  similar.  For  variance  function  v(fij)  =  (/>«,/+(  1  —  //,,-), 
the  estimates  equal  the  binomial  ML  estimates  but  SE  values  are  multiplied  by  </V''2  = 
V^/df  =  v/154.7/54  =  1.69. 

Figure  14.4  plots  the  standardized  residuals  against  litter  size  for  the  binomial  logit 
model.  The  apparent  increase  in  their  variability  as  litter  size  increases  suggests  that  the 
beta-binomial  variance  function  is  plausible.  For  that  variance  function,  the  probabilities  of 
death  for  litters  of  a  particular  group  have  standard  deviation  VPM/O  —  /+)>  where  in  the 
beta-binomial  distribution  p  corresponds  to  0/(1  +  9).  For  the  QL  fit,  p  =  0.192,  so  this 
standard  deviation  equals  0.22  when  the  mean  is  0.50  and  0. 1 3  when  the  mean  is  0. 1 0  or  0.90. 
This  is  considerable  heterogeneity.  More  generally,  a  model  could  let  p  vary  by  treatment 
group  or  be  different  for  the  placebo  group  from  the  others.  We  leave  this  to  the  reader. 

For  comparison,  Table  14.5  also  shows  results  with  the  GEE  approach  to  fitting  the  lo¬ 
gistic  model,  assuming  exchangeable  working  correlation  structure  for  observations  within 
a  litter.  The  empirical  sandwich  adjustment  increases  the  SE  values  compared  with  bino¬ 
mial  ML.  The  estimated  within-litter  correlation  between  the  binary  responses  is  0.185. 
This  is  comparable  to  the  value  of  0. 192  that  yields  the  quasi-likelihood  results  with  beta- 
binomial  variance  function.  The  GEE  standard  errors  are  somewhat  different  from  those 
with  the  quasi-likelihood  approach.  It  may  be  that  the  sample  size  is  insufficient  for  the 
GEE  sandwich  adjustment,  which  tends  to  underestimate  standard  errors  unless  the  number 
of  clusters  is  quite  large.  Or,  this  may  merely  reflect  the  different  variance  function  for  the 
GEE  approach. 

Finally,  Table  14.5  shows  results  for  the  GLMM  that  adds  a  normal  random  intercept  m, 
for  litter  i  to  the  binomial  logistic  model.  Estimated  effects  are  larger  for  this  logistic-normal 
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Figure  14.4  Standardized  Pearson  residuals  for  binomial  logistic  model  fitted  to  Table  4.7. 

model,  since  they  are  cluster-specific  (for  cluster  =  litter)  rather  than  population-averaged. 
Even  with  all  these  adjustments  for  overdispersion.  Table  14.5  shows  that  strong  evidence 
remains  that  the  probability  of  death  is  substantially  lower  for  each  treatment  group  than 
the  placebo  group. 

14.3.5  Conjugate  Mixture  Models 

The  beta-binomial  model  is  an  example  of  a  conjugate  mixture  model.  These  are  models 
for  which  the  marginal  distribution  has  closed  form.  The  data  have  a  particular  distribution, 
conditional  on  a  parameter,  and  then  the  parameter  has  its  own  distribution  such  that  the 
marginal  distribution  has  closed  form. 

Likewise,  from  Section  1.6.2,  in  Bayesian  methods  the  conjugate  prior  distribution  is  a 
distribution  that,  when  combined  with  the  likelihood,  gives  a  closed  form  for  the  posterior 
distribution.  For  instance,  for  binomial  observations  with  beta  prior  distribution  for  the 
parameter,  the  posterior  distribution  is  also  beta. 

Next,  we  present  a  conjugate  mixture  model  for  count  data.  It  uses  a  gamma  distribution 
to  mix  the  Poisson  parameter.  A  disadvantage  of  the  conjugate  mixture  approach  is  the 
lack  of  generality  and  flexibility,  requiring  a  different  mixture  distribution  for  each  type  of 
problem.  In  addition,  the  extra  variability  need  not  enter  on  the  same  scale  as  the  ordinary 
predictors,  and  it  can  be  difficult  to  have  multivariate  random  effects  structure.  Lee  et  al. 
(2006)  discussed  the  conjugate  approach  and  discussed  a  variety  of  hierarchical  models  of 
GLMM  form  in  which  the  random  effects  need  not  be  normal. 

14.4  NEGATIVE  BINOMIAL  REGRESSION 

The  negative  binomial  is  a  conjugate  mixture  distribution  for  count  data.  It  is  useful  when 
overdispersion  occurs  with  Poisson  GLMs. 


NEGATIVE  BINOMIAL  REGRESSION 


553 


14.4.1  Gamma  Mixture  of  Poissons  Is  Negative  Binomial 

A  severe  limitation  of  Poisson  models  is  that  the  variance  of  Y  must  equal  the  mean  (Section 
4.3.3).  Hence,  at  a  fixed  mean  the  variance  cannot  decrease  as  additional  predictors  enter 
the  model.  Count  data  often  show  overdispersion,  with  the  variance  exceeding  the  mean. 
This  might  happen,  for  instance,  because  some  relevant  explanatory  variables  are  not  in  the 
model.  A  mixture  model  is  a  flexible  way  to  account  for  overdispersion.  At  a  fixed  setting 
of  the  predictors  used,  given  the  mean  the  distribution  of  Y  is  Poisson,  but  the  mean  itself 
varies  according  to  some  distribution. 

Suppose  that  (1)  given  X,  Y  has  a  Poisson  distribution  with  mean  X,  and  (2)  X  has  a 
gamma  distribution,  G(k,  p).  The  gamma  probability  density  function  for  X  is 

(k/u)k 

f(X-,k,p)=K-^-cxp(-kX/p)Xk-',  X>0.  (14.12) 

T(£) 

This  gamma  distribution  has 


E(X)  —  /x,  var(X)  =  p2/k. 


The  parameter  k  >  0  describes  the  shape.  The  density  is  skewed  to  the  right,  but  the  degree 
of  skewness  (which  equals  2/y/k)  decreases  as  k  increases. 

Marginally,  the  gamma  mixture  of  the  Poisson  distributions  yields  the  negative  binomial 
distribution  for  Y  (Greenwood  and  Yule  1920).  Its  probability  mass  function  is 


p(y;k,  p)  = 


T  (y  +  k) 

r(*)T(y+l) 


y 


y  =  0, 1,2,.... 


(14.13) 


In  terms  of  the  dispersion  parameter  y  =  1  /  k, 

E(Y)  =  p,  var(T)  =  p  +  yp2. 


The  greater  y,  the  greater  the  overdispersion  relative  to  the  Poisson.  As  y  ->  0,  the  negative 
binomial  distribution  has  var( Y )  —*■  p  and  it  converges  to  the  Poisson  distribution  with 
mean  p. 

The  negative  bi  nomial  has  much  greater  scope  than  the  Poisson.  For  example,  the  Poisson 
mode  is  the  integer  part  of  the  mean  and  thus  equals  0  only  when  p  <  \.  The  negative 
binomial  mode  is  the  integer  part  of  p(k  —  1  )/k  (Johnson  et  al.  2005,  p.  217)  and  can  be  0 
for  any  p. 

For  independent  observations  from  a  negative  binomial  distribution,  the  ML  estimate  of 
p  is  the  sample  mean,  but  ML  estimation  for  y  requires  iterative  methods  (R.  A.  Fisher 
showed  this  in  an  appendix  of  a  1953  Biometrics  article  by  C.  Bliss).  An  alternative  gamma 
parameterization  implies  a  linear  rather  than  quadratic  variance  function  for  the  negative 
binomial  (Exercise  14.29). 


14.4.2  Negative  Binomial  Regression  Modeling 

Negative  binomial  models  for  counts  permit  p  to  depend  on  explanatory  variables.  Such 
models  normally  take  y  to  be  the  same  for  all  observations.  This  corresponds  to  a  constant 
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coefficient  of  variation  in  the  gamma  mixing  distribution,  yvarf/j/Ef/,)  =  N/y,  with  the 
standard  deviation  increasing  as  the  mean  does.  Most  common  is  the  log  link,  as  in  Poisson 
loglinear  models.  Sometimes  the  identity  link  is  adequate,  such  as  with  a  single  predictor 
that  is  a  factor. 

For  y  fixed,  a  negative  binomial  model  is  a  GLM.  The  likelihood  equations  for  the  re¬ 
gression  parameters  ft  are  then  special  cases  of  those  [see  (4.25)]  for  an  ordinary  GLM  with 
variance  function  v(/i)  =  /x  +  y  f-L1 .  The  usual  iterative  reweighted  least-squares  algorithm 
applies  for  ML  model  fitting.  The  full  log  likelihood  L(ft,  y;  y)  for  a  negative  binomial 
model  with  link  function  g  satisfies 

92L  —  ^  yi  —  M; 
dfijdy  “  (1  +  KM/)V(M/)Jr,'/ 

Thus,  E(d2L/dftjdy)  =  0  for  each  j.  Similarly,  the  inverse  of  the  expected  information 
matrix  has  0  elements  connecting  y  with  each  ftj.  Since  this  is  the  asymptotic  covariance 
matrix,  ft  and  y  are  asymptotically  independent. 

14.4.3  Example:  Frequency  of  Knowing  Homicide  Victims 

Table  14.6  summarizes  responses  of  1308  subjects  to  the  question:  Within  the  past  12 
months,  how  many  people  have  you  known  personally  that  were  victims  of  homicide?  The 
table  shows  responses  by  race,  for  those  who  identified  their  race  as  white  or  as  black.  The 
sample  mean  for  the  159  blacks  was  0.522,  with  a  variance  of  1.150.  The  sample  mean  for 
the  1 149  whites  was  0.092,  with  a  variance  of  0.155. 

A  natural  first  choice  for  modeling  count  data  is  a  Poisson  GLM,  such  as  a  loglinear 
model  with  an  indicator  predictor  for  race.  Let  y,,  denote  the  response  for  subject  t  of  race 
i.  For  M/r  =  £(T/r),  this  model  is 


log  =  a  +  ftx„. 


withxi,  =  1  (blacks)  and  x-u  =  0  (whites).  This  model  has  fit  log  fLit  —  —2.38+  1.733+,. 
The  estimated  expected  responses  are  exp(— 2.38  +  1 .733)  =  0.522  for  blacks  and 


Table  14.6  Number  of  Victims  of  Murder  Known  in  Past  Year,  by  Race, 
with  Fit  of  Poisson  and  Negative  Binomial  Models 


Response 

Data 

Poisson  GLM 

Neg.  Bin.  GLM 

Poisson  GLMM 

Black 

White 

Black 

White 

Black 

White 

Black 

White 

0 

119 

1070 

94.3 

1047.7 

122.8 

1064.9 

116.7 

1068.3 

1 

16 

60 

49.2 

96.7 

17.9 

67.5 

24.5 

65.3 

2 

12 

14 

12.9 

4.5 

7.8 

12.7 

8.1 

10.1 

3 

7 

4 

2.2 

0.1 

4.1 

2.9 

3.6 

2.8 

4 

3 

0 

0.3 

0.0 

2.4 

0.7 

1.9 

1.1 

5 

2 

0 

0.0 

0.0 

1.4 

0.2 

1.1 

0.5 

6 

0 

1 

0.0 

0.0 

0.9 

0.1 

0.7 

0.3 

Source:  1990  General  Social  Survey. 
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Table  14.7  Parameter  Estimates  for  Models  Fitted  to  Homicide  Data 


Term 

Models  with  Log  Link 

Models  with  Identity  Link 

Neg.  Binomial 
GLM 

Poisson 

GLM 

Poisson 

GLMM 

Neg.  Binomial 
GLM 

Poisson 

GLM 

a 

-2.38 

-2.38 

-3.69 

0.092 

0.092 

p 

1.733 

1.733 

1.897 

0.430 

0.430 

SE(0 ) 

0.238 

0.147 

0.246 

0.109 

0.058 

exp(— 2.38)  =  0.092  for  whites,  the  sample  means.  For  any  link  function  for  this  model,  the 
likelihood  equations  imply  that  the  fitted  means  equal  the  sample  means.  Since  0  =  1 .733 
( SE  —  0.147)  is  the  difference  between  the  log  means  for  blacks  and  whites,  the  ratio  of 
sample  means  is  exp(1.733)  =  5.7  =  0.522/0.092.  Table  14.6  also  shows  the  fit  of  this 
model. 

However,  the  data  show  evidence  of  overdispersion  for  a  Poisson  GLM,  as  for  each  race 
the  sample  variance  is  roughly  double  the  mean.  This  evidence  is  reflected  by  the  higher 
observed  counts  at  y  —  0  and  at  large  y  values  than  the  Poisson  GLM  predicts.  A  nega¬ 
tive  binomial  mixture  model  seems  plausible.  Due  to  demographic  factors,  heterogeneity 
probably  occurs  among  subjects  of  a  given  race  in  the  distribution  of  Y.  For  ML  fitting, 
the  deviance  decreases  by  122.2  compared  with  the  ordinary  Poisson  GLM  that  is  the 
special  case  with  y  =  0.  Table  14.6  also  shows  this  model  fit.  It  is  dramatically  better  at 
y  =  0  and  1 . 

Table  14.7  shows  parameter  estimates  for  the  negative  binomial  and  Poisson  GLMs. 
For  both,  0  =  1.733  since  both  models  provide  fitted  means  equal  to  the  sample  means. 
However,  the  estimated  standard  error  of  0  increases  from  0.147  for  the  Poisson  GLM  to 
0.238  for  the  negative  binomial  GLM.  The  Wald  95%  confidence  interval  for  the  ratio  of 
means  for  blacks  and  whites  is  exp[  1.733  ±  1.96(0.147)]  =  (4.2,  7.5)  for  the  Poisson  GLM 
but  exp[  1.733  ±  1.96(0.238)]  =  (3.5,  9.0)  for  the  negative  binomial  GLM.  In  accounting 
for  the  overdispersion,  we  obtain  results  that  are  not  as  precise  as  the  more  naive  model 
suggests  but  are  more  credible. 

The  negative  binomial  model  has  y  =  4.94  (SE  =  1 .00).  This  shows  strong  evidence 
that  y  >  0,  indicating  that  the  negative  binomial  model  is  more  appropriate  than  the  Pois¬ 
son  GLM.  The  estimated  variance  of  Y  is  /2  +  yjl2  =  /2  +4.94/2,  which  is  0.13  for 
whites  and  1.87  for  blacks,  much  closer  to  the  sample  values  than  the  Poisson  model 
provides. 

Table  14.7  also  shows  results  for  negative  binomial  and  Poisson  models  using  the  identity 
link.  Again,  the  fits  reproduce  the  sample  means  and  are  more  imprecise  but  more  credible 
with  the  negative  binomial  model.  For  this  link  also  the  estimated  dispersion  parameter  is 
y  =  4.94. 


14.5  POISSON  REGRESSION  WITH  RANDOM  EFFECTS 

We’ve  seen  that  a  flexible  way  to  account  for  overdispersion  is  with  a  mixture  model.  We’ve 
just  seen  that  mixing  the  Poisson  using  the  gamma  distribution  yields  the  negative  binomial 
marginally.  An  alternative  mixes  the  Poisson  log  mean  with  a  normal  random  effect. 
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14.5.1  A  Poisson  GLMM 

Breslow  (1984)  and  Hinde  (1982)  suggested  the  GLMM  structure  (13.1)  with  the  log  link 
and  normal  random  intercept.  The  model  for  the  mean  for  observation  t  in  cluster  i  is 

log[£(y„|«,)]  =  x]tp  +  Uj,  (14.14) 

where  {«,  }  are  independent  N(0,  a2).  Conditional  on  m,,  y„  has  a  Poisson  distribution. 
Marginally,  the  distribution  has  variance  greater  than  the  mean  whenever  a  >  0.  The 
identity  link  is  also  possible  but  has  a  structural  problem:  When  a  >  0,  a  positive  probability 
exists  that  the  linear  predictor  is  negative. 

The  negative  binomial  model  (for  fixed  /)  is  a  GLMM  with  nonnormal  random  effect. 
With  the  log  link,  it  results  from  a  loglinear  model  of  form  (14.14)  with  random  intercept, 
where  exp(«,)  has  a  gamma  distribution  with  mean  1  and  variance  y . 

14.5.2  Marginal  Model  Implied  by  Poisson  GLMM 

The  Poisson  GLMM  (14.14)  implies  a  relatively  simple  marginal  model,  averaging  out  the 
random  effect.  The  mean  of  the  marginal  distribution  is 

E{Y„)  =  £[£(T„|m,)]  =  E[ex«*+m]  =  ex"fi+r,2/1. 

Here  £[exp(«,)]  =  exp(cr2/2)  because  a  N(Q,  a2)  variate  u,  has  moment  generating  func¬ 
tion  £[exp(f«/)]  =  exp(r2<r2/2).  So,  for  the  Poisson  GLMM  the  log  of  the  mean  condition¬ 
ally  equals  xj,fi  +  u,  and  marginally  equals  xjtp  +  o2 / 2.  A  loglinear  model  still  applies. 
The  marginal  effects  of  the  explanatory  variables  are  the  same  as  the  cluster-specific  ef¬ 
fects.  Thus,  the  ratio  of  means  at  two  different  settings  of  Xj,  is  the  same  conditionally 
and  marginally.  However,  marginally  the  intercept  is  offset.  (Note  that  Jensen’s  inequality 
applies,  since  the  link  is  not  linear.) 

The  variance  of  the  marginal  distribution  is 

var(T„)  =  £[var(y„|H,)]  +  var[£(T„|M,)]  =  E[ex''fi+U]  +  e2l^var(e"') 

=  ex^+a2/2  +  e2x^(e2nl  -  e°2)  =  E(Yit)  +  [E(Yj,)]2(e°2  -  1). 

Here,  var(<?"' )  =  E(elu')  —  [£(£'■  )]2  =  ela"  —  ea~  by  evaluating  the  moment  generating 
function  at  t  =  2  and  t  —  1.  As  in  the  negative  binomial  model,  the  marginal  variance  is  a 
quadratic  function  of  the  marginal  mean.  It  exceeds  the  marginal  mean  when  o  >  0.  The 
ordinary  Poisson  model  results  when  o  =  0.  When  a  >  0  the  marginal  distribution  is  not 
Poisson,  and  the  extent  to  which  the  variance  exceeds  the  mean  increases  as  a  increases. 

As  in  binary  GLMMs,  Yit  and  Yis  are  independent  given  u,  but  are  marginally  nonneg- 
atively  correlated.  For  t  ^  s. 


cov(T„,  Yis)  =  £[cov(y„,  Yis\uj)]  +  co\[E(Yit\uj),  £(y,s|M,)] 

=  0  -I-  cov[exp (xjtp  +  m),  exp(x[?/3  +  «,)].  (14.15) 

The  functions  in  the  last  covariance  term  are  both  monotone  increasing  functions  of  u, .  and 
hence  are  nonnegatively  correlated  (Exercise  14.33). 


NOTES 
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14.5.3  Example:  Homicide  Victim  Frequency  Revisited 

For  Table  14.6  on  responses  of  the  number  of  known  victims  of  homicide  within  the  past 
12  months,  models  permitting  subject  heterogeneity  are  sensible.  For  the  response  y,,  for 
subject  t  of  race  the  Poisson  GLMM  is 


log[£(y„  |h„)]  =  a  +  fix  n  +  w„, 


where  {uit}  are  independent  N( 0,  a2).  The  log  means  vary  according  to  a  N(a,  a2)  distri¬ 
bution  for  whites  and  a  N(a  +  /3,  a2)  distribution  for  blacks.  Given  yit  has  a  Poisson 
distribution. 

Table  14.6  also  shows  this  model  fit,  and  Table  14.7  shows  estimates.  The  random  effects 
have  a  =  1.63  (SE  =  0.15).  The  deviance  decreases  by  116.6  compared  with  the  Poisson 
GLM,  indicating  a  better  fit  by  allowing  heterogeneity.  For  subjects  at  the  means  of  the 
random  effects  distributions  («,,  =  0)  the  estimated  expected  responses  are  exp(— 3.69  + 
1.90)  =  0. 167  for  blacks  and  exp(— 3.69)  =  0.025  for  whites.  The  fitted  marginal  mean  is 
expfa  +  fixj,  +  o2/2 ),  or  0.63  for  blacks  and  0.09  for  whites.  The  fitted  marginal  variances 
are  0.21  for  blacks  and  5.78  for  whites.  These  are  somewhat  larger  than  the  sample  means 
and  variances,  perhaps  because  the  fitted  distribution  has  nonnegligible  mass  above  the 
largest  observed  response  of  6. 


14.5.4  Negative  Binomial  Models  versus  Poisson  GLMMs 

The  Poisson  GLMM  with  normal  random  effects  has  the  advantage,  relative  to  the  negative 
binomial  GLM,  of  easily  permitting  multivariate  random  effects  and  multilevel  models. 
However,  the  negative  binomial  has  properties  that  can  make  interpretation  simpler.  We've 
seen  that  the  identity  link  is  valid  for  it,  which  is  useful  for  simple  examples  such  as  the 
preceding  one  with  a  factor  predictor.  With  any  link  and  a  factor  predictor,  its  ML  fitted 
means  equal  the  sample  means.  This  is  not  the  case  for  the  Poisson  GLMM. 


NOTES 

Section  14.1:  Latent  Class  Models 

14.1  Latent  variables:  For  fitting  and  interpretation  of  latent  class  and  related  latent  variable  mod¬ 
els,  see  Aitkin  et  al.  (1981),  Bartholomew  et  al.  (201 1),  Clogg  (1995),  Clogg  and  Goodman 
(1984),  Collins  and  Lanza  (2009),  Goodman  (1974).  Haberman  (1979,  Chap.  10),  Hage- 
naars  and  McCutcheon  (2009),  Heinen  (1996),  Lazarsfeld  and  Henry  (1968),  Magidson  and 
Vermunt  (2004),  Skrondal  and  Rabe-Hesketh  (2004),  and  Vermunt  (2003).  For  a  similar 
mixed-membership  model,  each  subject  has  partial  membership  in  various  classes,  with  a  dis¬ 
tribution  specifying  a  probability  for  membership  in  a  class  (Erosheva  et  al.  2007).  Espeland 
and  Handelman  (1989),  Uebersax  (1993),  Uebersax  and  Grove  (1990,  1993),  and  Yang  and 
Becker  (1997)  presented  latent  variable  models  for  rater  agreement  and  diagnostic  accuracy. 

14.2  Mixture  goodness  of  fit:  Rudas  et  al.  (1994)  proposed  a  clever  mixture  method  for  sum¬ 
marizing  goodness  of  fit.  For  a  model  M  for  a  contingency  table  with  true  probabilities  n. 
they  used  the  mixture  n  =  (1  —  p)jt\  +  ptti.  with  n  \  the  model-based  probabilities  and  7r2 
unconstrained.  Their  index  of  lack  of  fit  is  the  smallest  such  p  possible  for  which  this  holds. 
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It  is  the  fraction  of  the  population  that  cannot  be  described  by  the  model.  This  recognizes  that 
any  given  model  does  not  truly  hold  but  is  useful  if  p  is  close  to  0. 


Section  14.2:  N onparametric  Random  Effects  Models 

14.3  RaschandQS:  For  connections  between  Rasch-type  models  and  quasi-symmetry  models,  see 
Agresti  (1993,  1997),  Conaway  (1989),  Darroch  ( 1981 ),  Darroch  et  al.  (1993),  and  Kelderman 
(1984). 


Section  14.3:  Beta-Binomial  Models 

14.4  Beta-binomial  references:  Skellam  (1948)  introduced  the  beta-binomial  distribution.  For 
modeling  using  this  distribution  or  related  quasi-likelihood  approaches,  see  Albert  (2010), 
Brooks  et  al.  (1997),  Capanu  and  Presnell  (2008),  Crowder  (1978),  Hinde  and  Demetrio 
(1998),  Lee  et  al.  (2006),  Liang  and  Hanfelt  (1994),  Liang  and  McCullagh  (1993),  Lindsey 
and  Altham  (1998),  Moore  (1986a),  Moore  and  Tsiatis  (1991),  Nelder  and  Pregibon  ( 1 987). 
Prentice  (1986),  Rosner  (1984,  1989)  [with  critique  by  Neuhaus  and  Jewell  (1990a)],  Slaton 
et  al.  (2000),  and  Williams  (1975,  1982).  For  beta-binomial  type  variance,  Ryan  (1995) 
and  Williams  (1988)  showed  advantages  of  the  quasi-likelihood  approach  over  ML.  The 
beta-binomial  generalizes  to  a  Dirichlet-multinomial:  Conditional  on  the  probabilities,  the 
distribution  is  multinomial,  and  the  probabilities  themselves  have  a  Dirichlet  distribution.  See 
Brier  (1980),  Guimaraes  (2005),  Guimaraes  and  Lindrooth  (2007),  Mosimann  (1962),  Paul 
et  al.  (1989),  and  Exercise  14.30. 

14.5  Developmental  toxicity:  For  modeling  overdispersion  caused  by  litter  effects  in  developmen¬ 
tal  toxicity  studies  with  binary  data,  see  Follman  and  Lambert  (1989),  Kupper  and  Haseman 
(1978),  Kupper  et  al.  (1986),  Lefkopoulou  et  al.  (1989),  and  Ryan(1992).  Ochi  and  Prentice 
(1984)  proposed  a  probit  model  based  on  an  underlying  normal  latent  variable  model  with 
common  pairwise  correlations. 


Section  14.4:  Negative  Binomial  Regression 

14.6  NB  modeling:  Johnson  et  al.  (2005,  Chap.  5)  summarized  properties  of  the  negative  binomial 
distribution.  Cameron  and  Trivedi  (1998,  p.  72)  showed  the  asymptotic  covariance  matrix  of 
model  parameter  estimates.  They  and  Lawless  (1987)  considered  a  moment  estimator  for  y 
and  studied  robustness  properties.  They  noted  that  is  consistent  if  the  model  for  the  mean  is 
correctly  specified,  even  if  the  true  distribution  is  not  negative  binomial.  Booth  et  al.  (2003), 
Hilbe  (201 1 ),  and  Hinde  and  Demetrio  (1998)  also  discussed  NB  modeling. 


Section  14.5:  Poisson  Regression  with  Random  Effects 

14.7  Zero-inflated  models:  Overdispersion  relative  to  the  Poisson  distribution  often  occurs  when 
the  frequency  of  0  outcomes  is  larger  than  expected.  One  way  to  deal  with  this  is  a  mixture 
model  that  mixes  a  distribution  that  is  degenerate  at  0  with  an  ordinary  Poisson  (or  negative 
binomial)  distribution.  See  Min  and  Agresti  (2005)  for  details  and  references. 


EXERCISES 

Applications 

14.1  Create  a  25  table  of  opinions  about  legalized  abortion  by  downloading  the  table 
for  the  items  labeled  (ABRAPE,  ABHLTH,  ABSINGLE,  ABDEFECT,  ABPOOR) 
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in  the  most  recent  GSS.  Fit  a  latent  class  model.  For  each  latent  class,  find  the 
estimated  probability  of  supporting  legalized  abortion  the  five  situations.  Suggest 
a  tentative  interpretation  for  the  classes. 

14.2  Fit  a  logistic-normal  random  effects  model  to  the  carcinoma  ratings  of  Table  14.1. 
Compare  results  to  those  for  latent  class  models  in  Section  14. 1 .3. 

14.3  For  capture-recapture  experiments,  Coull  and  Agresti  (1999)  used  a  quasi- 
symmetric  loglinear  model  with  no  higher-order  terms, 


iog,u(yi, ... ,  y7)  =  x  +  P\y\  +  •  ■  •  +  fry t  +  fry\yi  +  y\yi  —  +  yr-iyr)- 

Show  that  (a)  like  the  logistic-normal  GLMM,  this  model  has  exchangeable  asso¬ 
ciation  and  only  one  more  parameter  than  the  mutual  independence  model,  (b)  the 
fit  to  Table  13.6  yields  N  =  90.5  and  a  95%  profile-likelihood  confidence  interval 
for  A  of  (75,  125). 

14.4  A  data  set  on  pregnancy  rates  among  girls  under  1 8  years  of  age  in  1 3  north  central 
Florida  counties  has  information  on  a  3-year  total  for  each  county  i  on  n,  =  number 
of  births  and  y,  =  number  of  those  for  which  the  mother’s  age  was  under  1 8  (see  J. 
Booth,  in  Statistical  Modelling:  Lecture  Notes  in  Statistics,  104 ,  Springer,  43-52, 
1995). 

a.  For  a  beta-binomial  model,  the  ML  estimated  parameters  are &\  =  9.9  and  fr  — 
240.8.  Use  the  mean  and  variance  to  describe  the  estimated  beta  distribution 
and  the  estimated  marginal  distribution  of  T,  (as  a  function  of  «,  ). 

b.  Quasi-likelihood  using  variance  function  (14.10)  for  the  model  logit(jU,) — 
a  has  a  =  —3.18  and  p  =  0.005.  Describe  the  estimated  mean  and  variance 
of  Yj. 

c.  Quasi-likelihood  using  variance  (14. 1 1)  for  the  model  logit^,  )  =  a  has  a  = 
—3.35  and  (f>  =  8.3.  Describe  the  estimated  mean  and  variance  of  T,  . 

d.  The  logistic-normal  GLMM,  logit(7T,)  =  a  +  yields  a  =  —3.24  and  a  = 
0.33.  Describe  the  estimated  mean  of  Y,  [Recall  (13.9)]. 

14.5  In  Exercise  13.2  about  Ray  Allen’s  three-point  shooting,  the  simple  binomial  model, 
jti  =  a,  has  lack  of  fit.  Fit  the  beta-binomial  model  or  use  the  quasi-likelihood 
approach  with  that  variance  structure.  Use  the  fit  to  summarize  his  free-throw 
shooting,  by  giving  an  estimated  mean  and  standard  deviation  for  7T, . 

14.6  Extend  the  various  analyses  of  the  teratology  data  in  Section  14.5  as  follows: 

a.  Include  a  predictor  for  litter  size  (as  well  as  group).  Interpret,  and  compare 
results  to  those  without  this  predictor. 

b.  Fit  a  model  with  beta-binomial  variance  (14.10)  in  which  p  varies  by  treatment 
group.  Use  results  to  motivate  a  model  that  allows  overdispersion  only  in  the 
placebo  group.  Interpret  and  compare  results  to  those  with  common  p  for  each 
group. 
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Table  14.8  Data  for  Exercise  14.7 


Clutch 

Treatment  I 

Treatment  2 

Treatment  3 

Number 

Hatched 

Total 

Number 

Hatched 

Total 

Number 

Hatched 

Total 

1 

0 

6 

3 

6 

0 

6 

2 

0 

13 

0 

13 

0 

13 

3 

0 

10 

8 

10 

6 

9 

4 

0 

16 

10 

16 

9 

16 

5 

0 

32 

25 

28 

23 

30 

6 

0 

7 

7 

7 

5 

7 

7 

0 

21 

10 

20 

4 

20 

Source:  Data  courtesy  of  Becca  Hale,  Zoology  Department,  University  of  Florida. 


14.7  Table  14.8  reports  the  results  of  a  study  of  fish  hatching  under  three  environments. 
Eggs  from  seven  clutches  were  randomly  assigned  to  three  treatments,  and  the 
response  was  whether  an  egg  hatched  by  day  10.  The  three  treatments  were  (1) 
carbon  dioxide  and  oxygen  removed,  (2)  carbon  dioxide  only  removed,  and  (3) 
neither  removed. 

a.  Let  7i [,  denote  the  probability  of  hatching  for  an  egg  from  clutch  i  in  treatment 
t.  Assuming  independent  binomial  observations,  fit  the  model 

logit(jr„)  =  +  /32r2  +  &Z3. 

where  z,  —  1  for  treatment  t  and  0  otherwise.  What  does  your  software  report 
for  j}\,  and  what  should  it  be?  [Hint:  Note  that  treatment  1  has  no  successes.] 

b.  Analyze  these  data  using  an  approach  that  allows  overdispersion.  Interpret. 
Indicate  whether  evidence  of  overdispersion  occurs  for  treatments  2  and  3. 

14.8  Copy  the  “Ohio  Children  Wheeze  Status”  data  at  the  website  cran.  r-project . 
org/web/packages/geepack/geepack.pdf  for  the  geepack  package  in  R. 
Analyze  these  data  using  one  method  from  each  of  Chapters  12,  13,  and  14. 
Compare  results  and  interpret. 

14.9  In  2002  the  General  Social  Survey  asked  “How  many  people  at  your  work  place  are 
close  friends?”  The  756  responses  had  a  mean  of  2.76,  standard  deviation  of  3.65, 
and  a  mode  of  0.  If  you  plan  to  build  a  GLM  using  some  explanatory  variables  for 
this  response,  which  distribution  might  be  sensible?  Why? 

14.10  One  question  in  a  GSS  asked  subjects  how  many  times  they  had  sexual  intercourse 
in  the  preceding  month. 

a.  The  sample  means  were  5.9  for  males  and  4.3  for  females;  the  sample  variances 
were  54.8  and  34.4.  The  mode  for  each  gender  was  0.  Does  an  ordinary  Poisson 
GLM  seem  appropriate?  Explain. 
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b.  The  Poisson  GLM  with  log  link  and  an  indicator  variable  for  gender  (1  = 
males,  0  =  females)  has  gender  estimate  0.308  ( SE  =  0.038).  Find  the  Wald 
95%  confidence  interval  for  the  ratio  of  means  for  males  and  females. 

c.  For  the  negative  binomial  model,  the  log  likelihood  increases  by  248.7.  The 
estimated  difference  between  the  log  means  is  also  0.308,  but  now  SE  =  0. 1 27. 
Find  the  95%  confidence  interval  for  the  ratio  of  means.  Compare  to  the  Poisson 
GLM,  and  interpret.  Which  do  you  think  is  more  appropriate?  Why? 

14.11  For  the  data  in  the  previous  exercise,  argue  that  a  possibly  more  realistic  model 
assumes  for  gender  i  a  proportion  p,  that  is  necessarily  0  and  a  proportion  1  —  p, 
that  has  distribution  that  is  a  gamma  mixture  of  Poissons. 

14.12  For  the  homicide  data,  reproduce  the  results  in  Table  14.7  for  the  identity  link. 
Explain  why  the  estimated  difference  in  means  is  identical  for  the  two  GLMs  but 
the  SE  values  are  very  different.  Use  the  more  appropriate  one  to  form  a  confidence 
interval  for  the  true  difference  in  means. 

14.13  For  the  horseshoe  crab  satellite  counts  in  Table  4.3,  use  width  as  a  predictor. 

a.  Fit  a  negative  binomial  model  with  log  link.  Interpret.  Describe  the  estimated 
variance  as  a  function  of  / 1 . 

b.  Fit  a  Poisson  GLMM  with  log  link.  Interpret. 

c.  Compare  results  for  the  models,  including  those  in  Section  4.3.2  for  Poisson 
and  negative  binomial  GLMs.  Indicate  your  preferred  model.  Justify. 

14.14  Use  quasi-likelihood  methods  to  analyze  Table  14.6  on  counts  of  murder  victims. 

14.15  Refer  to  Exercise  4. 1 .  With  data  at  the  book’s  website,  use  methods  of  this  chapter 
to  analyze  how  the  countywide  vote  for  Pat  Buchanan  in  2000  related  to  the  vote 
for  Ross  Perot  1996.  Note  that  Palm  Beach  County  is  an  enormous  outlier.  Model 
with  and  without  that  observation  and  compare  results. 

Theory  and  Methods 

14.16  When  I  —  2,  for  q  >  2  show  that  we  need  T  >  4  for  the  latent  class  model  to  be 
unsaturated.  Then,  find  the  maximum  value  for  q  when  T  —  4,  5.  For  an  I2  table, 
show  we  need  <  /2/(2/  —  1). 

14.17  Express  the  log  likelihood  for  latent  class  model  (14.1)  in  terms  of  the  model 
parameters.  Derive  likelihood  equations  (Goodman  1974,  Haberman  1979). 

14.18  In  Section  14.2.3,  under  the  null  that  the  ordinary  logistic  regression  model  holds, 
explain  why  it  is  inappropriate  to  treat  the  difference  between  the  deviances  for 
that  model  and  the  mixture  of  two  logistic  regressions  as  a  chi-squared  statistic. 

Express  the  numerator  of  the  beta  density  in  terms  of  p  and  9.  Using  this,  show 
that  it  is  (a)  unimodal  when  6  <  min(p,  1  —  p),  and  (b)  the  uniform  density  when 
P  =  0  -  5- 


14.19 
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14.20  Suppose  tt,  =  P(Yit  =  1)  =  1  -  P{YU  =0),  for  t  =  l, ...  ,nt,  and 
corr(T,, ,  Yis)  =  p  for  t  /  s.  Show  that  var(Y„)  =  tt, ( 1  —  tt, ),  cov(K,, ,  K,v)  = 
p7Tj(  1  —  7T|  ),  and 


=n,3T,(  1  —  7T/)[1  +p(«,  -  1)]. 


14.21  Show  that  the  beta-binomial  distribution  (14.9)  simplifies  to  the  binomial  when  (a) 
0  =  0,  (b)  n  =  1 .  Explain  why  overdispersion  cannot  occur  when  n  =  1 . 

14.22  Liang  and  Hanfelt  (1994)  described  a  teratology  study  comparing  control  and 
treatment  groups  in  which  the  ML  estimate  of  the  treatment  effect  in  a  beta- 
binomial  model  differs  by  a  factor  of  2  depending  on  whether  you  assume  the 
same  overdispersion  parameter  for  each  group.  By  contrast,  with  variance  function 
(14.1 1),  the  quasi-likelihood  estimate  of  the  treatment  effect  is  the  same  whether 
you  assume  the  same  or  different  V>  for  the  two  groups.  Explain  why,  and  discuss 
whether  this  is  an  advantage  or  disadvantage  of  that  method. 

14.23  For  small  a,  show  that  the  logistic-normal  model,  logifijr, )  =  a  +  x]  p  +  uit  cor¬ 
responds  approximately  to  a  mixture  model  for  which  the  mixture  distribution  has 
var(7r,  )  =  [/*,-(  1  —  p.,  )|;o  3  .  [Hint:  See  Exercise  4.35.] 

14.24  Altham  (1978)  introduced  the  discrete  distribution 

/(y;jr,  xfr)  =  c(tt,  VO  y"  j  ^(1  -  exp[xjsy(n  -  y)],  y  =0, 1, 

where  c(n,  \j/ )  is  a  normalizing  constant.  Show  that  this  is  in  the  exponential  family. 
Show  that  the  binomial  occurs  when  x/f  —  0.  [Altham  noted  that  overdispersion 
occurs  when  x/j  <  0.  Corcoran  et  al.  (2001)  and  Lindsey  and  Altham  (1998)  used 
this  as  the  basis  of  an  alternative  model  to  the  beta-binomial.] 

14.25  Refer  to  the  previous  exercise.  For  n  identically  distributed  but  correlated  binary 
observations  (yi ,  yi, . . . ,  y„),  a  related  loglinear  model  is 

log/r(y,,y2.  +  v  ' 


Explain  why  this  is  a  simple  special  case  of  the  quasi-symmetry  model,  and  explain 
how  the  binomial  is  a  special  case. 


14.26  When  y\, . . . ,  yN  are  independent  from  a  negative  binomial  distribution  (14.13) 
with  y  fixed,  show  that  ft  =  y. 
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14.27  Using  E(Y)  =  E[E(Y\X)]  and  var(K)  =  £[var(y|X)]  +  var[£(F|X)],  derive  the 
mean  and  variance  of  the  (a)  beta-binomial  distribution,  and  (b)  negative  binomial 
distribution. 

14.28  Suppose  that  given  it,  Y  is  Poisson  with  E(Y\u)  =  u/i,  where  /x  may  depend 
on  predictors.  Suppose  that  u  is  a  positive  random  variable  with  E{u)  —  1  and 
var(«)  =  r.  Show  that  E{Y)  —  /x  and  var(T)  =  /x  +  r/x2.  Explain  how  negative 
binomial  GLMs  and  Poisson  GLMMs  with  log  link  can  follow  as  special  cases. 

14.29  An  alternative  negative  binomial  parameterization  results  from  the  gamma  density 
formula, 

f(k',k,  /x)  -  — — -exp (~kk)kk,1~l ,  k  >  0, 

r  (k/i) 

for  which  E(X)  =  /x ,  var(A.)  —  fi/k.  Show  that  this  gamma  mixture  of  Poissons 
yields  a  negative  binomial  with 

E(Y)  =  fi,  var(T)  =  /x(l  +  k)/k. 

For  what  limiting  value  of  k  does  this  reduce  to  the  Poisson?  [See  Lee  and  Nelder 
(1996)  for  ML  model  fitting.  Cameron  and  Trivedi  (1998,  p.  75)  pointed  out  that, 
unlike  with  quadratic  variance,  consistency  does  not  occur  for  the  GLM  parameter 
estimators  when  the  model  for  the  mean  holds  but  the  true  distribution  is  not 
negative  binomial.] 

14.30  Suppose  Y\  and  Y 2  are  independent  negative  binomial  variates  with  common  disper¬ 
sion  parameter  y .  Show  that  Y\  +  Y2  is  negative  binomial  with  dispersion  parameter 
y / 2.  Show  that  Y\,  conditional  on  Y\  +  IS-  is  beta-binomial.  State  the  multiple- 
category  extension  that  yields  a  Dirichlet-multinomial  distribution.  Explain  the 
analogy  with  the  Poisson-multinomial  result  in  Section  1 .2.5. 

14.31  Show  that  the  loglinear  random  effects  model 

log[E(K/,|M,)]  =  xJ,P  +  zTitUi, 

where  {«, }  are  independent  N{ 0,  E),  implies  the  marginal  loglinear  model 
log[E(K,,)]  -  {zl'Lzu  =  x], P, 
with  the  same  fixed  effects  but  with  offset  term. 

14.32  In  Section  14.5.2  and  the  previous  exercise  we  saw  that  for  Poisson  GLMMs,  the 
marginal  effects  are  the  same  as  the  cluster-specific  effects.  This  does  not  imply 
that  ML  estimates  of  effects  are  the  same  for  a  Poisson  GLMM  and  a  Poisson 
GLM.  Explain  why.  [Hint:  For  the  GLMM,  is  the  marginal  distribution  Poisson?] 
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14.33  For  the  Poisson  GLMM  (14.14),  use  the  normal  moment  generating  function  to 
show  that,  for  t  ^  s, 

cov(F„,  Yis)  =  exp[(x£  +  j^.)0][exp(cr2)(exp(cr2)  -  1)]. 

Hence,  find  corr(F,,,  Yis). 

14.34  For  a  Poisson  GLMM  using  the  identity  link,  relate  the  marginal  mean  and  variance 
to  the  conditional  mean  and  variance.  Explain  the  structural  problem  that  this  model 
has. 


CHAPTER  15 


Non-Model-Based  Classification 
and  Clustering 


It  this  book  we’ve  focused  on  ways  of  modeling  categorical  response  data.  This  chapter 
presents  some  alternative  analyses  that  are  not  model-based  or  else  have  a  much  more 
general  model  structure. 

Sections  15.1  and  15.2  deal  with  non-model-based  alternatives  to  logistic  regression 
for  classifying  observations  into  response  categories.  In  Section  15.1  we  introduce  linear 
discriminant  analysis,  a  method  that  is  more  efficient  than  logistic  regression  when  the 
explanatory  variables  have  a  normal  distribution.  In  Section  15.2  we  present  a  method  for 
constructing  a  graphical  tree  for  making  such  predictions.  In  Section  15.3  we  discuss  ways 
of  grouping  sets  of  observations  on  multiple  response  variables  into  clusters. 


15.1  CLASSIFICATION:  LINEAR  DISCRIMINANT  ANALYSIS 

In  Section  6.3.3  we  used  logistic  regression  toclassify  binary  observations.  One  rule  predicts 
that  y  =  1  whenever  the  x  values  are  such  that  the  model  has  an  estimate  n  of  P(y  —  1 )  that 
exceeds  0.50.  Equivalently,  this  corresponds  to  having  linear  predictor  value  (including  the 

/s  T 

intercept)  in  the  model  satisfy  /}  x  >  0.  There  are  alternative,  non-model-based  ways  of 
dividing  the  set  of  explanatory  variable  values  into  two  sets,  in  one  of  which  the  predicted 
y  =  1  (which  we  denote  by  y  —  1)  and  in  the  other  of  which  y  —  0. 

We’ve  seen  one  such  method  in  Section  7.4.4,  using  the  kernel  approach  of  nearest 
neighbor  smoothing.  The  best  known  non-model-based  method,  called  linear  discriminant 
analysis,  is  another  simple  alternative  to  logistic  regression  for  binary  classification.  Recall 
that  logistic  regression  makes  no  assumption  about  the  distribution  of  X  and  instead  focuses 
on  the  binary  distribution  of  Y  given  x  By  contrast,  linear  discriminant  analysis  also  makes 
an  assumption  about  the  distribution  of  X,  given  y.  For  it,  like  logistic  regression,  the 
boundary  between  the  two  sets  of  x  values  with  y  =  1  and  y  —  0  is  linear.  Why  do  this 
instead  of  logistic  regression?  When  the  normality  assumption  is  reasonable,  there  is  the 
potential  of  an  efficiency  improvement  from  using  the  extra  information. 
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15.1.1  Classification  with  Normally  Distributed  Predictors 

In  Section  5.1.5  (and  Exercise  5.30)  we  noted  that  normal  distributions  for  (X\ Y  —  j)  for 
j  —  0,  1  imply  a  logistic  regression  curve  for  P(Y  =  1  |.v).  For  multiple  predictors,  suppose 
that  (J\T | y  =  y)has  a  multivariate  I! ) distribution,  j  =0,  1 .  Then,  by  Bayes’ theorem 

with  7i  =  P(Y  =  1),  it  follows  that  P(Y  =  1  |jc)  satisfies 

logit[P(T  =  1|jc)]  =  log— ^ - Uii0  +  /ii)rE“Vi  -  /t0)  +  (/i,  -  iio/T-'x. 

1—7 r  2 

That  is,  a  logistic  regression  model  holds  with  effect  parameters  ft  =  (fi \  —  Ho)T  .  The 
effects  are  stronger  when  the  groups  having  y  =  1  and  having  v  =  0  are  farther  apart  and 
when  there  is  less  variability  within  those  groups. 

Fisher  (1936)  developed  a  related  method  for  using  observations  on  jc  to  classify  on 
y,  before  the  advent  of  logistic  regression.  It  assumes  a  common  covariance  matrix  E  for 
X  within  each  category  for  y,  but  its  motivation  does  not  require  normality  assumptions. 
Fisher’s  goal  was  to  find  a  linear  combination  lT x  such  that  its  values  when  y  =  1  were 
separated  as  much  as  possible  from  its  values  when  y  =  0,  relative  to  the  variability  of  lT x 
values  within  each  y  category.  The  solution  maximizes  the  squared  distance  between  the 
means  of  lT x  for  the  two  categories  ofy,  divided  by  the  within-category  variance  of  lT  x. 
Equivalently,  for  a  given  value  of  jc,  the  prediction  for  y  is  the  category  j  (j  —  0,  1 )  that  has 
the  minimum  of  the  Mahalanobis  distance  of  jc  from  //.,  =  E(X\Y  =  j),  which  is 

dj(x)  =  (x  -  fij)TI.-'{x  -  fLj),  j  =  0,1. 

When  we  have  a  prior  value  7T0  for  P(Y  =  1 ),  then  logit(TTo)  is  subtracted  from  the  distance 
for  j  =  1 . 

In  practice,  we  estimate  these  distances  by  substituting  the  sample  means  X\  and  jc0  and 
a  pooled  covariance  estimate  S  for  E .  This  method  yields  a  linear  function  as  the  boundary 
between  the  sets  of  jc  values  having  y  =  1  and  y  =  0.  At  a  particular  jc,  y  —  1  if 


(*i  -  *o)7  S  'x  >  (*i  -  x0)rS  ’(jc,  +  x0)/2  -  logit(7r0). 


For  example,  with  7To  =  0.50  and  a  single  predictor x  having  at  >  To,  we  predict  that  y  =  1 
if  x  >  (jc,  +  .co)/2,  that  is,  if  x  is  closer  to  .?j  than  to  To. 

This  prediction  rule  depends  on  jc  only  through  the  left-hand  term  in  this  equation. 
This  term,  (X|  —  jco)rS-ljc,  is  called  Fisher’s  linear  discriminant  function.  In  fact,  the 
regression  function  for  ordinary  least-squares  regression  of  an  indicator  variable  for  y  on  jc 
(which  is  the  ML  fit  of  the  linear  probability  model  under  a  normal  response  assumption)  is 
proportional  to  that  term.  Because  of  this  connection  between  Fisher’s  linear  discriminant 
function  and  the  regression  equation,  the  observations  having  y  =  1  are  those  for  which 
the  linear  regression-based  estimate  of  £(T|jc)  is  sufficiently  high. 

Consider  now  the  additional  assumption  that  the  distribution  of  X  in  each  y  category 
is  multivariate  normal  with  common  covariance  matrix.  Then,  Bayes’  theorem  with  a  par¬ 
ticular  prior  value  ttq  =  P(Y  =  1 )  provides  proper  posterior  probability  estimates  for  each 
category.  With  7Tq  =  0.50  and  the  estimated  Mahalanobis  distance  values,  at  a  particular 
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value  of  *, 


exp[— 

P(Y  —  l|x)= - - 

exp[-|^(x)]  +  exp[-i<if(x)] 

Section  6.3.3  presented  the  classification  table  as  a  way  of  summarizing  predictions 
made  using  a  fitted  logistic  regression  model.  This  type  of  table  can  describe  the  quality  of 
predictions  with  any  method  of  classification.  The  true  misclassification  probabilities  tend  to 
be  underestimated  by  predicting  observations  using  the  equation  to  which  those  observations 
contributed,  so  cross-validation  can  be  employed  to  obtain  less  biased  estimates. 


ffijx)] 


15.1.2  Example:  Horseshoe  Crab  Satellites  Revisited 

In  Section  6.3.3  we  illustrated  classification  tables  for  logistic  regression  using  the  horseshoe 
crab  data  set,  for  the  model  using  a  female  crab’s  width  and  color  as  predictors  of  whether 
that  crab  has  at  least  one  male  satellite.  To  illustrate  linear  discriminant  analysis,  as  in  model 
(5.14)  in  Section  5.4.6  we’ll  use  the  quantitative  scoring  (1 ,2,3,4)  for  the  color  levels  rather 
than  treating  color  as  a  factor,  so  that  the  normality  assumption  for  the  joint  distribution 
of  x  —  width  and  c  =  color  is  not  so  badly  violated.  For  tiq  =  0.50,  software  (SAS  PROC 
DISCRIM)  reports  the  linear  discriminant  function  0.430*  —  0.553c,  with  y  =  1  when 
0.430*  -  0.553c  >9.811. 

The  least-squares  fit  of  the  linear  probability  model  for  y  is  n  =  —  1 .234  +  0.08 11*  — 
0.1024c.  The  coefficients  of  *  and  c  are  identical  to  those  from  the  linear  discriminant 
function  divided  by  5.30.  The  inequality  for  predicting  y  based  on  the  linear  discriminant 
function  (i.e.,  y  —  1  when  0.430*  —  0.553c  >  9.81 1)  is  equivalent  to  —  1 .851  +  0.081 1*  — 
0.1024c  >  0,  or  n  >  0.614  for  the  fit  of  the  linear  probability  model.  (The  inequality  is 
equivalent  to  A  >  0.50  only  when  the  sample  proportion  of  y  =  1  is  0.50.) 

Figure  15.1  shows  the  data  and  the  classification  regions  obtained  with  linear  dis¬ 
criminant  analysis.  The  boundary  line  is  c  =  —18.08  +  0.79*.  The  figure  also  shows  the 
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Figure  15. 1  Classification  regions  (solid  line  for  linear  discriminant  analysis,  dotted  line  for  logistic  regression) 
with  width  and  color  predictors  of  presence  of  horseshoe  crab  satellites. 


568 


NON-MODEL-BASED  CLASSIFICATION  AND  CLUSTERING 


Table  15.1  Classification  Tables  for  Predictions  Using  Discriminant 
Analysis  and  Logistic  Regression  for  Horseshoe  Crab  Data 


Discriminant  Logistic 

Analysis  Regression 


Actual 

y  —  i 

y  =  o 

y  =  i 

.9  =  0 

Total 

y  —  1 

72 

39 

77 

34 

111 

y  =0 

19 

43 

21 

41 

62 

boundary  line  from  using  logistic  regression  with  a  cutoff  of  ft  >  0.614  for  y  —  1,  for 
which  c  =  —20.70  +  0.90x.  In  practical  terms,  the  regions  are  very  similar. 

Table  15.1  shows  the  classification  table  that  results  using  cross-validation,  in  which 
to  predict  observation  i,  we  use  the  linear  discriminant  function  obtained  with  the  other 
n  —  1  observations.  Table  15.1  also  shows  a  classification  table  based  on  logistic  regression 
modeling,  with  cross-validation.  To  enhance  comparability  of  the  two  approaches,  we  used 
0.614  as  the  boundary  for  ft  for  the  predictions. 

15.1.3  Multicategory  Classification  and  Other  Versions  of  Discriminant  Analysis 

In  discriminant  analysis,  the  linear  separating  boundary  in  the  space  of  x  values  between 
y  —  1  and  y  —  0  can  be  generalized.  If  we  include  quadratic  and  cross-product  interaction 
terms  in  x ,  the  boundary  becomes  quadratic  in  the  space  of  the  original  x  variables  (An¬ 
derson  1975).  Interestingly,  the  normal  assumption  for  the  distribution  of  (.Y|T  =  j)  but 
with  unequal  covariance  matrices  implies  a  logistic  regression  model  for  P(Y  =  1)  that  is 
quadratic  in  x.  For  other  generalizations,  see  Note  15.2. 

In  the  other  direction  (i.e.,  simplicity),  a  diagonal  discriminant  analysis  simplification 
treats  the  common  covariance  matrices  for  (X\Y  =0)  and  (X\Y  =  1)  as  being  diagonal. 
This  seems  like  an  assumption  that  could  be  badly  violated  for  most  applications;  however, 
as  discussed  in  Section  15.1.4,  when  the  number  of  predictors  is  very  large,  it  can  result  in 
better  classification  performance  than  linear  or  quadratic  discriminant  analysis. 

Linear  discriminant  analysis  extends  directly  to  multicategory  classification.  When 
Y  =  j,  denote  the  pdf  of  X  by  gj{x),  and  let  iij  =  P(Y  =  j),  j  =  1,2,...,/.  By  Bayes’ 
theorem. 


P(Y  =  j\X  =  x)  = 


ttjgj(x) 

T.I,  KhghW 


If  we  assume  a  particular  parametric  family  for  {gj},  we  can  use  data  to  estimate  the 
densities  and  hence  estimate  classification  probabilities  for  Y . 

The  most  common  way  to  do  this  assumes  that  (X\Y  =  j )  has  a  multivariate  N(/x .y,  E) 
distribution,  j  =  1,2,...,/,  with  the  same  covariance  matrix  for  each  group.  It  can  then 
be  shown  (Warner  1963)  that 


P(Y  —  j\x)  71  j  1  t  _ 1  T  _  I 

°gP(T  =  J\x )  =  lQg  ^7  ~  2(/ty  +  /ty)  £  Vy  -  P-.i)  +  (P-j  -p-j)  Z  x. 

That  is,  a  baseline-category  logit  model  holds  with  effect  parameters  /?,  =  (//,  —  /ty)rE-i. 
After  estimating  the  multivariate  normal  parameters  by  the  sample  means  (x7)  and  a 
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pooled  covariance  estimate  S ,  the  method  predicts  that  y  =  j  if  the  linear  discriminant 
function 


x]  S  'x  -  ^xJtS  1  xh  +  log  ft/, 
takes  maximum  value  for  h  —  j. 

As  in  the  binary  case,  if  the  assumption  were  truly  satisfied  about  X  having  a  normal 
conditional  distribution  with  common  covariance,  this  classification  method  would  be 
optimal.  To  avoid  such  a  strong  assumption,  we  can  instead  use  direct  ML  fitting  with  the 
baseline-category  logit  model. 

15.1.4  Classification  Methods  for  High  Dimensions 

In  classification  problems  with  large  p,  Bickel  and  Levina  (2004)  found  that  prediction 
rules  that  treat  those  explanatory  variables  as  independent  often  outperform  rules  that 
estimate  dependences  among  them  in  constructing  the  classifier.  This  reflects  the  difficulty 
in  estimating  well  the  covariance  matrix  when  p  is  very  large.  They  showed  that  a  naive 
Bayes  rule  that  assumes  independence,  such  as  in  diagonal  discriminant  analysis,  can  greatly 
outperform  ordinary  linear  discriminant  analysis. 

An  assumption  of  independence  seems  stringent  and  grossly  invalid  for  many  appli¬ 
cations.  However,  with  very  large  p ,  we  often  expect  any  particular  predictor  to  be  very 
weakly  correlated  with  most  of  the  other  predictors,  and  a  true  correlation  value  is  typically 
closer  to  0  than  to  the  ML  estimate  based  on  an  enormous  correlation  matrix.  For  example, 
Dudoit  et  al.  (2002)  found  that  this  method  performs  better  than  ordinary  linear  discriminant 
analysis  for  classifying  tumors  using  gene  expression  data.  However,  Fan  and  Fan  (2008) 
noted  that  even  for  the  independence  classification  rule,  performance  can  be  poor  because 
of  the  accumulation  of  noise  unless  there  is  some  variable  reduction. 

For  classification,  another  simple  alternative  to  logistic  regression  and  linear  discriminant 
analysis  is  the  nearest  neighbors  method  (Section  7.4.2).  It  classifies  an  observation  based 
on  an  estimated  probability  obtained  by  averaging  response  values  for  nearby  observations. 
A  challenge  with  this  method  is  that  with  a  very  large  p,  the  “curse  of  dimensionality" 
occurs  and  a  subject  may  have  few  or  no  close  neighbors. 

A  more  complex  method  used  with  large  p  is  support  vector  machines,  presented  in 
Section  15.2.6.  Zhang  et  al.  (2006)  used  this  approach  together  with  a  SCAD-type  penalty 
for  the  application  of  identifying  important  genes  for  cancer  classification.  But  as  dis¬ 
cussed  in  Section  15.2.6,  there  is  no  guarantee  that  more  complex  methods  will  have  better 
performance. 

15.1.5  Discriminant  Analysis  Versus  Logistic  Regression 

When  the  explanatory  variables  truly  have  a  normal  distribution,  conditional  on  y,  dis¬ 
criminant  analysis  is  optimal  for  classifying  observations.  In  particular,  it  is  more  efficient 
than  logistic  regression,  potentially  considerably  more  as  the  groups  become  more  widely 
separated  (Efron  1975).  This  is  because  it  utilizes  the  information  about  the  distribution  of 
X ,  which  logistic  regression  ignores. 

Often,  however,  explanatory  variables  can  be  far  from  normally  distributed,  such  as 
when  at  least  one  explanatory  variable  is  qualitative.  Also,  extreme  outliers  on  x  can  have 
a  large  effect  on  discriminant  analysis  (as  in  ordinary  linear  regression)  but  little  impact  on 
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logistic  regression.  So,  logistic  regression  is  more  robust  and  has  broader  scope,  as  it  makes 
no  assumption  about  a  distribution  for  X  and  merely  assumes  a  binomial  distribution  for  Y 
at  each  value  of  x.  Also,  logistic  regression  has  the  advantage  over  discriminant  analysis 
of  providing  direct  ways  of  summarizing  effects  of  explanatory  variables,  through  odds 
ratios.  See  Note  15.1  and  Section  8.5.2  of  McLachlan  (2004)  and  references  therein  for 
more  discussion  of  the  relative  merits  of  the  two  approaches. 


15.2  CLASSIFICATION:  TREE-STRUCTURED  PREDICTION 

In  recent  years  non-model-based  methods  have  been  further  developed  for  predicting  re¬ 
sponse  variables  using  data  on  a  set  of  explanatory  variables.  These  are  examples  of  methods 
often  referred  to  with  the  terms  machine  learning  and  data  mining.  Rather  than  relying 
on  a  model  to  summarize  effects  of  explanatory  variables  on  the  response  variable,  such 
methods  are  algorithm-driven.  Using  various  criteria,  they  provide  a  way  of  “learning” 
from  the  available  information  on  all  the  variables  to  estimate  the  unknown  relationship 
between  E(Y)  and  the  explanatory  variables  x.  This  results  in  an  algorithm  for  making 
future  predictions  of  y  based  solely  on  values  of  x.  The  effectiveness  of  the  algorithm  is 
evaluated  by  its  error  rate  for  future  samples. 

Even  with  a  model,  when  n  is  extremely  large,  significance  tests  are  less  relevant,  as 
statistical  significance  does  not  imply  practical  significance.  Inference  may  not  even  be 
relevant  because  of  nonprobability  sampling.  Some  strategies  may  be  useful  for  predic¬ 
tion  of  response  outcomes  even  if  they  have  complex  structure  and  do  not  correspond  to 
understandable  models. 

A  detailed  presentation  of  these  algorithmic  methods  is  beyond  the  scope  of  this  book.  In 
this  section,  we  describe  a  particular  method  for  binary  responses  that  provides  a  simple  tree- 
structured  depiction  of  how  predictions  can  be  made.  Compared  with  discriminant  analysis, 
this  classification  method  is  less  restrictive  in  distributional  assumptions  and  in  the  form 
for  the  predictor  decision  boundary.  However,  the  methods  yield  decision  boundaries  that 
are  highly  nonlinear  in  the  space  of  x  values. 

15.2.1  Classification  Trees 

The  classification  tree  method  formalizes  a  decision  process  that  uses  a  sequential  set  of 
questions  about  the  x  values  to  yield  a  classification  prediction  for  y.  A  created  graphical 
tree  summarizes  binary  splits  on  variables  at  various  stages  to  determine  the  prediction. 
This  method,  proposed  by  Breiman  et  al.  (1984)  and  extending  earlier  work  such  as  by 
Kass  (1980),  utilizes  classification  tables  in  the  process  of  forming  the  tree.  Its  set  of  x 
values  for  which  y  —  1  has  simple  form,  consisting  of  a  set  of  rectangular  regions. 

For  example,  consider  the  prediction  of  a  person’s  vote  for  the  Democrat  or  Republican 
candidate  in  a  U.S.  presidential  election.  Two  regions  that  yield  a  prediction  of  voting  for 
the  Republican  candidate  might  be  (1)  everyone  who  is  male  and  attends  religious  services 
at  least  once  a  week  and  has  annual  income  over  $50,000,  and  (2)  everyone  who  is  female 
and  who  opposes  legalized  abortion  and  is  married  and  never  been  divorced.  A  common 
application  of  classification  trees  is  making  a  prediction  about  whether  a  patient  has  a 
particular  medical  condition.  Zhang  and  Singer  (2010)  described  an  early  application  that 
used  responses  to  1 3  questions  based  on  results  of  physiological  tests  and  various  patient 
characteristics  (such  as  age  and  medical  history)  to  predict  whether  a  patient  arriving  at  an 
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emergency  room  complaining  of  chest  pain  has  had  a  heart  attack.  Breiman  et  al.  (1984, 
Chap.  6)  showed  a  similar  sort  of  application,  classifying  the  prognosis  of  heart  attack 
victims  as  survivors  or  as  early  deaths. 


15.2.2  Example:  Classification  Tree  for  a  Health  Care  Application 

We  use  an  example  to  illustrate  the  components  of  the  classification  tree  method.  Noe 
et  al.  (2009)  predicted  whether,  over  a  one-year  period,  elderly  subjects  participating  in  an 
assisted-living  program  disenroll  from  the  program  to  enter  a  nursing  home.  The  sample 
consisted  of  4654  individuals  who  had  been  enrolled  in  the  program  for  at  least  a  year  and 
who  did  not  die  during  that  one-year  period.  Of  this  sample,  325  (7%)  disenrolled  from  the 
program  during  the  year. 

Figure  15.2  shows  the  classification  tree.  It  summarizes  responses  to  four  questions  with 
binary  outcomes,  listed  next  together  with  the  counts  having  each  response  at  a  particular 
branch  of  the  tree: 

Ql:  Is  the  subject’s  age  >  70?  (3157  yes,  1497  no) 

Q2:  Is  the  subject’s  age  >  83?  (931  yes,  2226  no) 

Q3:  Does  the  subject  have  dementia?  (65  yes,  2161  no) 

Q4:  Does  the  subject  have  Parkinson’s  disease?  (37  yes,  2124  no) 

As  the  tree  indicates,  those  predicted  to  disenroll  from  the  program  were  those  of  age  >83, 
those  of  age  >70  and  <83  having  dementia,  and  those  of  age  >70  and  <83  not  having 
dementia  but  having  Parkinson’s  disease.  In  summary,  those  of  age  >83  are  predicted  to 


Figure  15.2  Classification  tree  for  predicting  disenrollment  from  an  assisted-living  program.  Source'.  Figure  2 
in  Noe  et  al.  (2009).  Reprinted  with  permission. 
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disenroll,  those  of  age  <70  are  predicted  to  remain  in  the  program,  and  those  of  age  between 
70  and  83  are  predicted  to  disenroll  if  they  have  dementia  or  Parkinson’s  disease. 

The  points  on  the  classification  tree  at  which  binary  splits  occur  are  called  nodes.  The 
initial  node  containing  all  the  observations  is  the  root  node.  The  nodes  beyond  which  no 
further  splits  occur,  shown  by  boxes  instead  of  circles  in  Figure  15.2,  are  called  terminal 
nodes.  Figure  15.2  has  9  nodes,  of  which  5  are  terminal.  The  terminal  nodes  partition  the 
entire  sample  into  disjoint  subsets. 


15.2.3  How  Does  the  Classification  Tree  Grow? 

The  method  for  constructing  a  binary  classification  tree  uses  a  recursive  partitioning  algo¬ 
rithm  for  determining  ( 1 )  how  to  choose  the  splitting  variable  at  each  node,  (2)  how  to  split 
a  node  on  a  chosen  variable,  and  (3)  how  to  declare  a  node  to  be  terminal.  Without  going 
into  detail,  we  now  outline  the  main  ideas.  First,  binary  splits  are  used  instead  of  multiway 
splits  so  the  data  do  not  get  too  fragmented  too  quickly.  In  any  case,  multiway  splits  can 
result  from  a  series  of  binary  splits,  such  as  Figure  15.2  does  with  age. 

Figure  15.2  predicts  that  931  +  65  +  37  =  1033  of  the  4654  subjects  disenroll.  In 
reality,  only  325  actually  did  disenroll.  In  terms  of  frequency  of  misclassification,  the  naive 
rule  that  predicts  that  no  one  would  disenroll  does  better.  The  classification  tree  does  not 
use  this  naive  rule  because  the  two  types  of  misclassifications  were  treated  differently 
in  constructing  the  tree.  Noe  et  al.  (2009)  focused  on  identifying  subjects  who  would 
disenroll.  They  assigned  a  cost  1 3  times  as  high  to  predicting  that  someone  would  remain 
in  the  program  who  actually  left  it  than  to  predicting  that  someone  would  leave  the  program 
who  actually  stayed.  The  relative  misclassification  cost  for  each  possible  prediction  is 
a  primary  factor  to  determining  the  splits.  Because  of  these  differing  costs,  the  method 
produces  three  terminal  nodes  identifying  subjects  as  susceptible  to  disenroll,  although  the 
actual  percentages  of  people  who  disenrolled  in  these  nodes  are  all  no  greater  than  18.9%. 
If  instead  the  costs  of  the  two  types  of  misclassification  were  equal,  at  each  terminal  node 
the  prediction  would  merely  be  the  one  with  the  smallest  number  of  misclassifications. 

The  tree-structured  classification  method  begins  at  the  root  node  with  all  the  sample 
subjects  and  first  selects  the  best  binary  predictor  of  the  response  variable.  In  Figure  15.2, 
age,  which  is  continuous,  is  split  into  <70  and  >70.  This  produces  two  new  nodes,  each  of 
which  are  candidates  for  further  binary  splitting.  For  an  ordinal  variable  or  a  quantitative 
variable  such  as  age,  the  split  takes  the  form  of  values  falling  above  versus  below  a  particular 
level.  For  a  nominal  variable,  the  split  is  based  on  ordering  the  categories  by  the  sample 
proportions  falling  in  the  response  category  of  interest  and  then  using  the  same  criterion  to 
select  a  cutpoint  to  separate  them  into  two  sets  of  categories. 

To  find  the  first  binary  split,  the  algorithm  forms  a  classification  table  of  the  form  of 
Table  15.2  for  each  possible  binary  split  for  each  predictor  variable.  Ideally,  two  nodes 
would  provide  perfect  prediction,  with  all  observations  in  one  row  of  the  table  falling  in 
one  column  and  all  observations  in  the  other  row  falling  in  the  other  column.  The  optimal 
split  comes  closest  to  this,  in  the  sense  of  maximizing  the  difference  between  the  deviance 
(based  on  a  binomial  likelihood  function)  for  the  model  with  a  common  probability  for  all 
observations  and  the  model  allowing  two  disjoint  regions  of  x  values  each  having  a  common 
probability.  For  a  typical  algorithm,  this  corresponds  to  using  a  statistical  test,  selecting 
the  split  that  yields  the  smallest  P-value  in  a  test  of  the  hypothesis  that  the  created  binary 
variable  has  no  effect  on  the  response.  The  significance  can  be  judged  after  making  some 
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Table  15.2  A  Classification  Table  for  a  Predictor  Split  with  the 
Disenrollment  Response 


Response  Outcome 

Predictor  x 

y  =  1  (Disenroll) 

y  =  0  (Remain) 

Left  node  x  <  c 

«n 

n  12 

Right  node  x  >  c 

^21 

n22 

Bonferroni-type  adjustment  (Loh  and  Shih  1997,  Sec.  7.5.2).  The  same  procedure  is  then 
used  with  each  new  node. 

The  tree  can  continue  growing  until  there  are  as  many  nodes  as  distinct  sets  of  values 
of  the  predictors.  In  practice,  this  is  overfitting,  and  a  stopping  rule  is  employed,  such  as 
stopping  when  any  new  node  would  have  fewer  than  some  fixed  number  of  observations. 

For  a  terminal  node,  the  prediction  taken  is  the  response  category  that  has  the  lowest 
misclassification  cost.  For  example,  consider  the  node  of  93 1  subjects  of  age  >  83,  of  whom 
1 1 2  disenrolled  and  8 1 9  stayed.  We  treat  the  cost  as  1  for  misclassifying  someone  to  disenroll 
who  actually  stays  and  13  for  misclassifying  someone  to  stay  who  actually  disenrolls.  The 
misclassification  cost  is  then  819  if  we  predict  that  these  931  subjects  disenroll  and  it  is 
13(112)  =  1456  if  we  predict  that  these  931  subjects  stay.  The  misclassification  cost  is 
lower  if  we  predict  that  they  all  disenroll,  so  this  is  the  prediction  for  this  terminal  node. 

15.2.4  Pruning  a  Tree  and  Checking  Prediction  Accuracy 

For  a  classification  tree  to  perform  better  for  future  prediction  and  not  be  overfitted  to 
the  data,  some  branches  of  the  tree  produced  by  the  basic  algorithm  can  be  eliminated. 
This  process  is  called  pruning.  One  way  to  prune  employs  a  measure  of  the  quality  of  a 
tree  that  is  an  average  of  the  quality  of  the  terminal  nodes,  weighted  by  the  proportion  of 
observations  at  each  such  node.  Let  p(t)  denote  the  proportion  of  observations  that  occur 
at  terminal  node  t.  Let  c(f)  denote  the  average  misclassification  cost  at  that  node,  which 
is  the  total  misclassification  cost  for  the  predictions  made  at  that  terminal  node  divided 
by  the  number  of  subjects  at  that  node.  For  example,  for  the  931  disenroll  predictions  at 
age  >83  which  account  for  the  fraction  p(t)  =  931/4654  =  0.20  of  the  sample,  we  have 
c(t)  —  819/931  —  0.88.  Ideally  we  want  a  relatively  simple  tree  that  has  good  predictive 
accuracy.  Thus,  we  could  use  a  criterion  corresponding  to  minimizing  a  measure  such  as 

p(t)c(t)  +  [A  x  (number  of  terminal  nodes)], 

t 

where  the  sum  is  taken  over  the  terminal  nodes  and  A.  is  a  smoothing  parameter. 

The  choice  of  A.  reflects  the  bias/variance  trade-off  betweeen  fitting  the  data  well  (many 
terminal  nodes,  low  bias)  and  having  a  parsimonious  tree  (relatively  few  terminal  nodes, 
low  variance).  With  A  =  0we  get  the  most  complex  possible  tree.  Generally,  with  very 
small  X,  the  data  may  be  overfitted.  As  X  increases,  more  pruning  occurs  and  the  tree  gets 
simpler.  For  Aj  <  A. 2,  a  tree  with  smoothing  parameter  a 2  is  nested  within  the  tree  with 
parameter  A.j.  Intervals  of  X  values  result  in  the  same  tree.  In  practice,  X  is  chosen  in  an 
adaptive  manner.  Ideally,  trees  for  different  X  are  tested  on  a  separate  validation  sample 
to  estimate  their  predictive  accuracies.  Several  trees  may  have  weighted  misclassification 
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Table  15.3  Classification  Accuracy  for  Tree  in  Figure  15.2 


Classification 

Observed  Response  Outcome 

Prediction 

Disenroll 

Remain 

Disenroll 

130 

903 

Remain 

195 

3426 

cost  for  this  validation  sample  that  is  near  the  minimum  such  cost.  A  tree  is  chosen  that  is 
relatively  simple  but  has  weighted  misclassification  cost  close  to  the  minimum.  If  a  separate 
sample  is  not  available,  then  a  cross-validation  method  can  use  part  of  the  original  sample 
to  suggest  possible  trees  and  the  rest  of  the  sample  to  test  them. 

The  classification  accuracy  is  portrayed  by  a  classification  table.  Table  15.3  shows  the 
table  for  this  example.  We  can  summarize  such  a  table  with  sensitivity  and  specificity 
measures  (Sections  2.1.3  and  6.3.3).  The  tree  correctly  predicts  the  proportion  130/(130  + 
195)  =  0.40  of  those  who  actually  disenrolled  and  3426/(903+3426)  =  0.79  of  those  who 
remained.  An  ROC  curve  can  show  how  these  rates  vary  as  we  vary  the  misclassification 
costs,  thus  affecting  the  predictions.  The  area  under  the  ROC  curve  can  be  compared  for 
various  classification  methods  as  a  way  of  comparing  their  success  rates  (Hastie  et  al. 
2009,  Sec.  9.2).  However,  Hand  (2009)  argued  that  this  measure  is  incoherent,  because  of 
different  misclassification  costs  for  different  classification  rules.  He  proposed  an  alternative 
approach  based  on  averaged  misclassification  cost  rather  than  averaged  sensitivity. 

For  many  data  sets,  some  subjects  are  missing  observations  on  at  least  one  predictor 
variable.  Various  approaches  can  be  used  so  those  subjects  enter  the  analysis.  When  consid¬ 
ering  a  predictor  for  a  particular  split,  a  simple  approach  uses  only  observations  for  which 
that  predictor  is  not  missing.  For  a  categorical  predictor,  instead  adding  a  new  category  for 
missing  can  help  reveal  when  missingess  is  not  at  random  but  is  associated  with  a  certain 
outcome  (Hastie  et  al.  2009,  p.  3 1 1 ). 


15.2.5  Classification  Trees  Versus  Logistic  Regression 

Classification  trees  provide  a  simple  mechanism  for  using  answers  to  a  set  of  binary 
explanatory  questions  to  predict  a  binary-response  variable.  A  person  can  view  the  tree 
and  clearly  see  which  subjects  have  y  =  1.  Compared  with  logistic  regression  and  other 
binary  classification  methods,  tree-structured  classification  has  the  advantage  of  being  easily 
understandable  and  useable  by  practitioners  who  have  little  understanding  of  basic  statistics. 
Also,  the  trees  do  not  require  assumptions  about  the  functional  relationship  between  the 
response  variable  and  the  predictor  variables.  In  particular,  it  is  easier  to  detect  potentially 
important  interaction  structure  among  the  predictors,  it  is  not  necessary  to  prespecify 
categories  for  continuous  predictors  such  as  age,  and  the  trees  are  invariant  to  monotone 
transformations  of  such  predictors.  The  trees  can  more  easily  accommodate  missing  data 
on  some  predictors,  and  they  rely  on  well-defined  variable  selection  procedures,  which  is  a 
thorny  issue  for  logistic  regression  with  a  large  number  of  explanatory  variables. 

A  disadvantage  of  a  classification  tree  compared  with  logistic  regression  modeling  is 
the  lack  of  smoothness  caused  by  each  terminal  node  of  subjects  being  treated  in  the  same 
way,  since  the  region  of  explanatory  variable  values  having  y  =  1  is  a  set  of  rectangular 
regions.  In  the  above  example,  for  instance,  all  subjects  of  age  <70  are  predicted  to  remain 
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in  the  program,  regardless  of  their  values  on  other  explanatory  variables.  If  there  truly  is  a 
simple  linear  structure  for  how  the  explanatory  variables  affect  the  response,  as  in  a  logistic 
regression  model  having  only  main  effects,  the  tree  will  not  help  us  discover  this  structure. 
Logistic  regression  has  the  advantage  over  classification  trees  and  discriminant  analysis  of 
providing  direct  ways  of  summarizing  effects  of  explanatory  variables,  through  odds  ratios. 
Moreover,  those  effects  are  all  conditional  on  the  other  explanatory  variables,  whereas  with 
the  classification  tree  the  displayed  effects  are  mixed;  the  first  split  refers  to  a  marginal 
effect,  the  second  to  a  conditional  effect  given  the  first  split,  and  so  forth.  Also,  rather  than 
relying  on  an  automatic  algorithm  for  forming  a  tree,  in  many  applications  it  is  better  to 
use  existing  theory  to  suggest  variables  to  use  at  particular  levels  of  the  hierarchy. 

Finally,  the  classification  tree  method  has  low  bias  but  high  variance.  There  can  be  high 
variability  in  classification  trees  produced  by  different  random  samples  from  a  common 
population,  partly  because  of  its  hierarchical  nature.  Two  samples  that  have  different  initial 
split  may  end  up  with  very  different  trees  because  of  the  influence  of  the  initial  split  on 
the  way  the  tree  evolves.  Or,  an  optimal  split  early  in  the  tree  construction  may  cause  the 
tree-constructing  algorithm  to  miss  another  useful  classifier. 

Because  of  this  variability  and  the  segmentation  into  possibly  very  small  groups  that 
occurs  with  multiple  splits,  the  classification  tree  method  can  require  rather  large  sample 
sizes  to  work  effectively.  Even  then,  when  the  number  of  predictors  is  large,  simpler  methods 
such  as  nearest  neighbor  methods  and  linear  discriminant  methods  that  treat  the  explanatory 
variables  as  uncorrelated  may  have  better  classification  performance.  For  example,  see 
Dudoit  et  al.  (2002),  who  compared  various  methods  for  classifying  tumors  using  gene 
expression  data.  To  reduce  the  high  variability  effect,  L.  Breiman  proposed  generalizations 
of  classification  trees.  Bagging  (a  term  that  stands  for  “bootstrap  aggregation”)  is  a  method 
of  averaging  many  trees,  each  constructed  from  an  alternative  sample  that  is  generated 
from  the  original  one  using  the  bootstrap  (Hastie  et  al.  2009,  Sec.  8.7).  Random  forest 
ensembles  of  tree-classifiers  also  average  trees,  but  select  at  each  node  a  small  group  of 
input  variables  on  which  to  consider  splits,  the  goal  being  to  reduce  the  correlation  between 
trees  (Hastie  et  al.  2009,  Chap.  15).  A  disadvantage  is  that  the  overall  contribution  of  a 
particular  predictor  is  less  clear  than  in  ordinary  classification  trees  or  in  logistic  regression. 

Because  of  the  lack  of  smoothness,  high  variability,  and  atheoretic  nature  of  classification 
trees,  many  researchers  use  this  method  mainly  in  an  exploratory  manner.  Results  of  a 
classification  tree  analysis,  combined  with  existing  theory,  can  suggest  logistic  models  to 
use  in  future  research. 

15.2.6  Support  Vector  Machines  for  Classification 

In  summary,  Sections  15.1  and  15.2  suggest  that  (1)  if  it  seems  reasonable  to  assume 
normally  distributed  X  with  common  covariance,  simple  linear  discriminant  analysis  is 
appropriate  for  classification;  (2)  if  X  may  be  far  from  normal  but  logistic  regression 
seems  reasonable,  we  can  use  it  for  classification;  (3)  if  X  may  interact  in  unknown  ways  to 
determiney  but  simple  rectangular  regions  are  desired  for  classification,  then  tree-structured 
methods  are  sensible. 

Finally,  a  more  complex  method,  support  vector  machines ,  has  a  decision  boundary 
that  can  be  highly  irregular.  For  logistic  regression,  we’ve  seen  (Section  6.5)  that  perfect 
prediction  and  at  least  one  infinite  ML  parameter  estimate  results  when  a  hyperplane  can 
separate  the  set  of  x  values  for  which  y  =  1  from  the  set  of  x  values  for  which  y  =  0.  This 
hyperplane  is  a  linear  decision  boundary.  For  classification  purposes  with  future  predictions. 
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the  optimal  hyperplane  has  the  maximum  margin  of  separation  between  it  and  the  nearest 
data  points  in  the  two  sets  of  observations.  In  practice,  we  usually  expect  the  sets  to  overlap, 
and  such  a  linear  decision  boundary  giving  perfect  predictions  is  not  available.  A  support 
vector  machine  attempts  to  improve  predictions  over  such  hyperplane  decision  boundaries 
by  producing  nonlinear  boundaries. 

With  support  vector  machines,  the  set  of  x  values  for  which  y  =  1  is  more  complex  than 
with  linear  discriminant  analysis  or  with  tree-structured  classification.  The  set  is  determined 
by  producing  a  linear  boundary  in  a  transformation  of  the  space  of  explanatory  variables, 
essentially  replacing  the  predictors  in  a  linear  discriminant  by  some  (possibly  much  larger) 
set  of  functions  of  them.  The  boundary  depends  on  only  a  subset  of  the  x,  values,  which 
are  the  support  vectors  and  fall  on  the  margin  hyperplanes.  A  kernel  smoothing  parameter 
controls  the  degree  of  nonlinearity,  and  a  separate  smoothing  parameter  controls  the  desired 
size  of  margin  between  the  decision  boundary  and  the  nearest  data  points. 

Hastieet  al.  (2009,  p.  1 1 1)  stated  that  ordinary  linear  discriminant  analysis  often  performs 
well  compared  with  more  exotic  methods.  Even  though  linear  discriminant  analysis  may 
have  higher  bias,  it  has  the  benefit  of  low  variance  because  of  its  simplicity.  Similarly,  Hand 
(2006)  argued  that  “simple  methods  typically  yield  performance  almost  as  good  as  more 
sophisticated  methods,  to  the  extent  that  the  difference  in  performance  may  be  swamped 
by  other  sources  of  uncertainty.”  He  noted,  for  example,  that  in  practice  the  data  points 
available  for  determining  the  classification  rule  are  not  randomly  drawn  from  the  same 
distribution  to  which  the  classifier  will  be  applied,  so  statements  about  classifier  accuracy 
need  to  take  this  into  account.  Also,  interpretability  is  often  an  important  requirement  of  a 
classification  rule,  and  this  favors  simple  methods. 


15.3  CLUSTER  ANALYSIS  FOR  CATEGORICAL  DATA 

The  methods  presented  so  far  in  this  chapter  have  distinguished  between  response  and 
explanatory  variables.  For  example,  discriminant  analysis  and  classification  trees  are  like 
logistic  regression  in  using  values  on  explanatory  variables  to  classify  observations  into  two 
well-defined  groups  that  are  the  categories  of  the  response  variable.  In  some  applications, 
such  groups  are  not  identified,  but  it  is  still  relevant  to  sort  observations  into  clusters  of 
observations. 

For  example,  in  “market  basket  data”  applications,  a  person’s  observation  is  a  vector  of 
binary  indicators  in  which  a  particular  component  indicates  whether  the  person  purchased 
the  corresponding  item.  A  consumer  research  study  might  want  to  identify  groups  of 
customers  with  similar  buying  behavior.  Likewise,  a  company  such  as  Google  that  provides 
Internet  searching  capability  might  seek  to  identify  groups  of  people  who  have  similar 
browsing  behavior.  A  company  such  as  Amazon  recommends  products  to  people  based  on 
a  cluster  affinity  analysis  that  takes  into  account  their  purchase  history  and  the  history  of 
other  people  who  have  bought  the  same  items.  A  financial  institution  might  try  to  detect 
outliers  in  purchases,  such  as  in  credit  card  fraud  detection.  In  biology,  clustering  methods 
can  organize  plants  or  animals  into  groups  according  to  observed  features.  With  gene 
microarray  data,  clustering  methods  can  identify  clusters  of  genes  with  similar  patterns  of 
expression  and  may  help  to  identify  genes  responsible  for  certain  diseases  (Hastie  et  al. 
2009,  Chap.  7;  Dudoit  et  al.  2003).  Clustering  methods  have  even  been  used  to  group 
different  brands  of  Scotch  whisky  (Lapointe  and  Legendre  1994). 
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15.3.1  Supervised  Versus  Unsupervised  Learning 

Discriminant  analysis  is  sometimes  referred  to  as  supervised  learning ,  because  known  clas¬ 
sifications  for  some  observations  can  be  used  (as  if  provided  by  a  “supervisor”)  to  develop 
a  discriminant  function  that  can  classify  other  observations  measured  on  the  same  explana¬ 
tory  variables.  By  contrast,  clustering  methods  are  examples  of  unsupervised  learning:  The 
classification  categories  (the  clusters)  are  unknown  but  features  are  observed  that  relate  to 
the  unobserved  categories. 

In  this  section  we’ll  consider  data  sets  consisting  of  n  observations  on  a  vector  of  p 
binary  variables,  the  goal  being  to  group  those  observations  into  a  set  of  k  clusters.  Those 
clusters  can  be  regarded  as  categories  of  an  unknown  variable.  The  number  k  itself  may  be 
unknown. 

We  can  summarize  the  data  as  a  2P  contingency  table  that  cross-classifies  the  n  obser¬ 
vations  on  the  p  binary  variables.  In  some  applications,  such  as  market  basket  data,  p  may 
be  very  large.  The  table  is  then  extremely  sparse.  It  is  more  useful  to  express  analyses 
directly  in  terms  of  the  n  x  p  data  file  of  indicator  variables,  where  row  i  shows  the  p 
binary  responses  (y,  i , . . . ,  y,p )  for  observation  /. 


15.3.2  Measuring  Dissimilarity  Between  Observations 

Clustering  methods  use  a  measure  of  dissimilarity  between  observations.  The  clusters  group 
together  similar  observations.  Ideally,  observations  within  a  cluster  have  low  dissimilarity 
whereas  observations  in  different  clusters  have  high  dissimilarity.  A  clustering  method  is 
characterized  by  its  dissimilarity  measure  and  the  algorithm  for  implementing  the  clustering. 

For  vectors  of  observations  on  p  binary  variables,  Table  15.4  summarizes  the  similarity 
and  dissimilarity  for  observations  h  and  i.  There  are  a  variables  j  in  the  vector  for  which 
yhj  —  yij  —  1,  and  d  for  which  y/,j  —  y,j  —  0.  A  simple  similarity  measure  for  a  pair  of 
observations  is  the  proportion  of  the  p  —  (a  +  b  +  c  +  d)  variables  that  have  a  match, 
which  is  (a  +  d)/(a  +  b  +  c  +  d).  The  corresponding  dissimilarity  measure  subtracts  the 
similarity  measure  from  1,  giving  the  proportion  (b  +  c)/(a  +  b  +  c  +  d)  for  which  the 
outcome  differs. 

In  some  applications,  a  common  response  of  1  is  more  relevant  than  a  common  response 
of  0.  With  market  basket  data,  for  example,  each  person’s  observation  consists  of  a  very 
high  proportion  of  0  entries  (i.e.,  items  not  bought),  so  there  is  necessarily  a  high  proportion 
of  variables  with  a  common  outcome.  Then,  an  asymmetric  similarity  measure  may  be  more 
relevant.  A  popular  similarity  measure  of  this  type  is  a /(a  +  b  +  c),  the  number  of  variables 
coded  as  1  for  both  observations  divided  by  the  number  of  variables  that  are  coded  as  1  for 
either  or  both  observations.  This  is  a  special  case  of  the  Jaccard  index ,  which  for  two  sets 


Table  15.4  Cross  Classification  of  Two  Observations 
on  p  Binary  Variables,  where  p  =  (a  +  b  +  c  +  d) 


Observation  / 

Observation  h 

I  0 

1 

0 

a  b 

c  d 
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is  defined  as  the  size  of  the  intersection  divided  by  the  size  of  the  union.  The  corresponding 
dissimilarity  index  is  ( b  +  c)/(a  +  b  +  c). 

Likewise,  measures  of  similarity  and  dissimilarity  can  be  defined  for  pairs  of  clusters  of 
observations.  The  average  linkage  measures  the  average  of  the  dissimilarities  between  all 
the  pairs  of  observations,  one  from  each  cluster.  Such  measures  do  not  account,  however, 
for  associations  among  the  variables  and  treat  them  all  identically.  Alternatively,  if  the 
observations  result  from  a  probability  sample,  such  as  multinomial  over  the  2P  cells  of 
the  table  cross-classifying  the  p  variables,  we  could  use  a  log-likelihood-based  measure 
of  dissimilarity.  For  example,  the  distance  between  two  clusters  could  be  defined  as  the 
decrease  in  the  maximized  log  likelihood  when  we  compare  a  model  with  separate  parameter 
values  for  each  cluster  to  a  model  with  a  common  parameter  value  for  the  two  clusters. 

15.3.3  Clustering  Algorithms:  Partitions  and  Hierarchies 

For  a  particular  dissimilarity  measure,  two  types  of  algorithms  are  commonly  used  to  per¬ 
form  the  clustering.  One  type  partitions  the  observations  in  various  ways  and  evaluates  each 
partition  according  to  some  criterion.  For  a  working  partition  at  some  stage  for  a  /.'-cluster 
solution,  the  medoid  of  a  cluster  is  the  observation  with  smallest  total  dissimilarity  to  the 
other  points  in  the  cluster.  The  goal  of  k-medoid  clustering  is  to  seek  an  optimal  partition 
in  terms  of  minimizing  the  sum  over  the  clusters  of  the  total  within-cluster  dissimilarities 
between  the  observations  and  their  medoids.  For  an  initial  partitioning,  one  algorithm  as¬ 
signs  each  observation  to  the  cluster  to  which  it  has  smallest  dissimilarity  with  that  cluster’s 
medoid,  then  recomputes  the  medoids,  and  iterates.  Kaufman  and  Rousseeuw  (1990)  pro¬ 
posed  an  alternative  strategy  that  successively  moves  each  medoid  to  an  observation  that  is 
not  currently  one,  then  making  the  exchange  that  provides  greatest  reduction  in  the  sum  of 
the  total  within-cluster  dissimilarities,  continuing  until  no  exchanges  are  found  that  provide 
an  improvement. 

The  other  main  type  of  algorithm  creates  a  hierarchical  decomposition  of  the  observa¬ 
tions  according  to  some  criterion.  The  clusters  at  a  particular  level  of  the  hierarchy  result 
from  merging  clusters  at  the  next  level.  At  one  extreme  there  is  a  single  cluster  of  all  obser¬ 
vations  and  at  the  other  extreme  each  observation  forms  its  own  cluster.  The  entire  hierarchy 
portrays  an  ordered  sequence  of  clusters.  We  can  either  create  clusters  by  starting  with  each 
observation  as  its  own  cluster  and  merging  them  ( agglomerative  clustering )  or  instead  start 
with  all  observations  in  a  single  cluster  and  at  each  stage  divide  an  existing  cluster  into 
two  clusters  ( divisive  clustering).  With  agglomerative  clustering,  a  step  of  the  algorithm 
combines  into  a  single  cluster  the  pair  of  clusters  having  the  smallest  dissimilarity.  With  a 
hierarchical  clustering  method,  a  tree  called  a  dendrogram  displays  the  process  of  merging 
or  dividing  clusters.  It  portrays  the  grouping  as  a  function  of  a  metric  such  as  the  average 
dissimilarity  between  clusters  being  merged,  and  hence  shows  the  clusters  at  each  stage. 
The  example  in  Section  15.3.4  illustrates. 

Either  clustering  algorithm  has  advantages  and  disadvantages.  For  the  agglomerative 
hierarchical  approach,  two  samples  from  a  population  that  combine  clusters  differently 
at  an  early  stage  may  have  quite  different  looking  dendrograms  at  a  later  stage.  For  the 
partitioning  approach,  it  is  usually  computationally  impractical  to  consider  all  possible 
partitions.  It  is  necessary  to  either  implement  some  stochastic  element  to  the  process  or 
weaken  the  criterion  used.  Ultimate  results  may  depend  on  the  initial  partition  used,  and 
it  is  sensible  to  try  a  few  different  ones  (e.g.,  perhaps  including  the  solution  obtained  with 
a  hierarchical  method)  to  increase  the  chance  of  finding  the  globally  optimal  solution. 
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According  to  Hastie  et  al.  (2009,  p.  506),  “specifying  an  appropriate  dissimilarity  measure 
is  far  more  important  in  obtaining  success  with  clustering  than  the  choice  of  clustering 
algorithm.”  For  any  algorithm,  however,  the  clusters  found  may  not  reflect  a  true  categorical 
classification  but  may  merely  be  an  artifact  of  that  algorithm. 

The  number  of  clusters  k  is  often  unknown.  Any  algorithm,  whether  of  a  partitioning  or 
a  hierarchical  nature,  requires  some  termination  condition  for  determining  k.  For  example, 
agglomerative  hierarchical  clustering  could  keep  combining  clusters  as  long  as  the  average 
dissimilarity  between  a  pair  of  clusters  to  be  combined  is  less  than  some  particular  fixed 
value.  An  informal  way  to  choose  k  plots  the  value  of  the  clustering  criterion  against 
k,  looking  for  a  natural  break  point  where  this  changes  substantially.  Or,  we  could  plot 
(against  k)  a  summary  such  as  the  probability  that  the  dissimilarity  for  a  randomly  selected 
within-cluster  pair  of  observations  is  smaller  than  the  dissimilarity  for  a  randomly  selected 
between-cluster  pair.  An  adaptation  of  the  Goodman  and  Kruskal  gamma  measure  takes 
the  difference  between  this  concordance  probability  and  a  discordance  probability,  divided 
by  their  sum  (Baker  and  Hubert  1975). 

Partitioning  and  hierarchical  clustering  methods  need  not  assume  a  probability  model  for 
the  data.  Some  clustering  methods,  though,  are  probabilistic  model-based.  Such  approaches 
usually  assume  that  the  observations  come  from  a  ^-component  mixture  distribution  of  some 
type.  An  example  is  the  latent  class  model  of  Section  14.1.  With  it,  each  observation  is 
not  actually  assigned  a  cluster  but  rather  a  probability  distribution  over  the  clusters  (latent 
classes)  to  which  it  could  belong  (Fraley  and  Raftery  2002,  Magidson  and  Vermunt  2004). 
In  practice,  observations  are  typically  assigned  to  the  cluster  for  which  the  probability  value 
is  highest.  This  model-based  approach  applies  also  directly  to  multicategory  response  data. 

With  very  high-dimensional  data,  such  as  market  basket  data  or  DNA  microarray  data, 
the  challenges  to  clustering  are  many,  regardless  of  the  type  of  data.  There  may  be  many 
irrelevant  variables  that  have  the  impact  of  masking  clusters,  as  clusters  might  exist  only 
for  a  very  small  subset  of  the  variables.  Dissimilarity  measures  then  are  less  meaningful,  as 
observations  may  be  close  for  the  most  relevant  variables  but  the  curse  of  dimensionality 
may  put  them  far  apart  in  high-dimensional  space.  To  attempt  to  account  for  this,  clustering 
can  be  attempted  in  various  subspaces  by  clustering  observations  on  subsets  of  variables 
rather  than  all  of  them  simultaneously  (Brusco  2004,  Friedman  and  Meulman  2004). 

15.3.4  Example:  Clustering  States  on  Election  Results 

The  text  website  has  a  5 1  x  8  matrix,  with  a  row  for  each  U.S.  state  and  D.C.,  that  shows 
the  party  (Democrat  or  Republican)  that  won  the  electoral  votes  for  that  state  for  each 
presidential  election  between  1980  and  2008.  Table  15.5  shows  an  excerpt  from  that  table. 

We  measure  dissimilarity  using  the  number  of  elections  on  which  the  states  differ.  States 
with  identical  vectors  of  responses,  such  as  Massachusetts  and  New  York,  have  dissimilarity 
values  of  0.  By  contrast,  Minnesota  and  Texas  differ  in  the  outcome  for  every  election,  so 
the  dissimilarity  is  8.  Minnesota  and  Virginia  agree  in  only  1  of  the  8  elections,  so  their 
dissimilarity  is  7. 

The  agglomerative  hierarchical  algorithm  starts  with  51  clusters,  one  for  each  state  and 
D.C.  At  the  first  step,  states  are  combined  that  have  the  minimum  dissimilarity,  which  in 
this  case  consists  of  states  such  as  Massachusetts  and  New  York  that  have  dissimilarity  of  0. 
At  that  step,  eight  clusters  of  states  have  dissimilarity  of  0  for  all  pairs  within  each  cluster. 
At  the  next  step,  clusters  are  combined  with  the  next  smallest  dissimilarity,  such  as  those 
two  states  with  California,  for  which  the  dissimilarity  is  1.  By  the  stage  at  which  there  are 
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Table  15.5  Statewide  Data  on  Party  (Dem  =  Democrat,  Rep  =  Republican)  Winning 
Electoral  Votes  in  Presidential  Elections  between  1980  and  2008 


State 

1980 

1984 

1988 

1992 

1996 

2000 

2004 

2008 

Arizona 

Rep 

Rep 

Rep 

Rep 

Dem 

Rep 

Rep 

Rep 

California 

Rep 

Rep 

Rep 

Dem 

Dem 

Dem 

Dem 

Dem 

Colorado 

Rep 

Rep 

Rep 

Dem 

Rep 

Rep 

Rep 

Dem 

Florida 

Rep 

Rep 

Rep 

Rep 

Dem 

Rep 

Rep 

Dem 

Illinois 

Rep 

Rep 

Rep 

Dem 

Dem 

Dem 

Dem 

Dem 

Massachusetts 

Rep 

Rep 

Dem 

Dem 

Dem 

Dem 

Dem 

Dem 

Minnesota 

Dem 

Rep 

Dem 

Dem 

Dem 

Dem 

Dem 

Dem 

Missouri 

Rep 

Rep 

Rep 

Dem 

Dem 

Rep 

Rep 

Rep 

New  Mexico 

Rep 

Rep 

Rep 

Dem 

Dem 

Dem 

Rep 

Dem 

New  York 

Rep 

Rep 

Dem 

Dem 

Dem 

Dem 

Dem 

Dem 

Ohio 

Rep 

Rep 

Rep 

Dem 

Dem 

Rep 

Rep 

Dem 

Texas 

Rep 

Rep 

Rep 

Rep 

Rep 

Rep 

Rep 

Rep 

Virginia 

Rep 

Rep 

Rep 

Rep 

Rep 

Rep 

Rep 

Dem 

Wyoming 

Rep 

Rep 

Rep 

Rep 

Rep 

Rep 

Rep 

Rep 

Source :  Complete  data  at  www .  stat .  uf  1  .  edu/~aa/ cda/cda  .  html. 


only  two  clusters,  one  cluster  has  21  states  and  D.C.  that  tend  to  vote  Democrat,  and  the 
other  cluster  has  29  states  that  tend  to  vote  Republican. 

To  more  easily  portray  the  agglomerative  cluster-forming  process  as  well  as  a  dendro¬ 
gram  for  displaying  results,  we  redo  the  analysis  using  only  the  data  shown  in  Table  15.5 
for  14  states.  Figure  15.3  shows  the  dendrogram.  The  bottom  nodes  of  the  figure  show  the 
initial  14  clusters.  The  vertical  scale  shows  the  average  dissimilarity  between  clusters  being 
joined.  At  the  first  stage,  three  pairs  of  clusters  are  joined,  of  two  states  each,  that  have 


Figure  15.3  Dendrogram  (produced  using  dist  and  hchtst  functions  in  R)  for  cluster  analysis  of  14  states 
according  to  presidential  election  results,  for  data  in  Table  15.5. 
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dissimilarities  of  0.  At  the  next  stage,  the  (California,  Illinois)  cluster  is  joined  with  the 
(Massachusetts,  New  York)  cluster,  the  average  dissimilarity  between  those  two  clusters 
being  1.  At  the  same  stage,  Virginia  is  joined  with  the  (Texas,  Wyoming)  cluster,  Colorado 
and  Ohio  are  joined  in  a  cluster,  and  Arizona  and  Florida  are  joined  in  a  cluster.  At  this 
stage  there  are  seven  clusters,  these  four  plus  the  single-state  clusters  of  Missouri,  New 
Mexico,  and  Minnesota. 

The  top  of  the  dendrogram  is  the  joining  of  all  states  into  a  single  cluster.  The  two-cluster 
solution,  below  it,  shows  the  Republican-leaning  cluster  (Colorado,  Ohio,  Virginia,  Texas, 
Wyoming,  Missouri,  Arizona,  Florida)  and  the  Democrat-leaning  cluster  (Minnesota,  New 
Mexico,  California,  Illinois,  Massachusetts,  New  York).  At  the  two-cluster  step,  when 
Minnesota  is  joined  with  five  other  states,  the  average  dissimilarity  between  Minnesota  and 
the  other  five  states  is  14/5  =  2.8. 


NOTES 

Section  15.1:  Classification:  Linear  Discriminant  Analysis 

15.1  Discriminant  versus  logistic:  For  more  on  discriminant  analysis,  see  Hastie  et  al.  (2009, 
Chap.  4),  Lin  et  al.  (2010),  McLachlan  (2004),  and  Tutz  (201 1,  Chap.  15).  For  classification 
when  normality  does  not  hold,  Anderson  (1975),  Bull  and  Donner  (1987),  Efron  (1975), 
McLachlan  (2004,  Sec.  8.5),  and  Press  and  Wilson  (1978)  compared  methods,  generally 
rating  logistic  regression  more  favorably  than  discriminant  analysis. 

15.2  Discriminant  generalizations:  The  prediction  rule  in  discriminant  analysis  can  be  amended 
to  take  into  account  different  misclassification  costs  for  the  two  types  of  misclassifications 
or  to  minimize  a  risk  function  based  on  a  penalty  function.  See  Eguchi  and  Copas  (2002). 
It  generalizes  also  to  handle  multiple  classes,  regularized  covariance  matrices  and  lasso- 
type  penalties,  assuming  a  mixture  of  normal  distributions  for  each  category,  penalizing  the 
coefficients  to  make  them  smoother,  and  nonparametric  estimation  of  the  distribution  of 
(AT  | K  =  j)  or  of  the  form  of  the  regression.  See  Hastie  et  al.  (2009,  Sec.  12.4)  and  Witten 
and  Tibshirani  (20 1 1 ).  Articles  dealing  with  classification  methods  for  large  p  include  Bickel 
and  Levina  (2004),  Fan  and  Fan  (2008),  Friedman  (1989),  Mai  et  al.  (2012  and  references 
therein),  Tibshirani  et  al.  (2003),  and  Wu  et  al.  (2009).  Fan  and  Fan  (2008)  showed  a  way  to 
quantify  the  impact  of  dimensionality  on  classification. 

Section  15.2:  Classification:  Tree-Structured  Prediction 

15.3  Trees/extensions:  For  more  about  tree-structured  classification  and  its  generalizations,  see 
Breiman  et  al.  (1984),  Hastie  et  al.  (2009,  Sec.  9.2),  Loh  (2002),  Loh  and  Shih  (1997), 
Tutz  (2011,  Chap.  11),  and  Zhang  and  Singer  (2010).  See  Zhang  (1998)  for  extensions  to 
multiple  binary  responses,  Piccarreta  (2008)  for  ordinal  responses,  and  Meulman  (2003)  for 
an  interesting  overview.  Hastie  et  al.  (2009)  is  a  good  but  technical  reference  for  various 
“machine  learning”  methods.  See  Chapter  12  for  support  vector  machines.  See  Azzalini  and 
Scarpa  (2012)  for  a  less  technical  introduction  to  data  mining  methods.  Blanchard  et  al. 
(2008)  studied  the  support  vector  machines  algorithm  from  a  statistical  perspective.  Li  (2010) 
proposed  a  boosting  algorithm  for  multiclass  tree-based  classification. 

Section  15.3:  Cluster  Analysis  for  Categorical  Data 

15.4  Clustering  extensions:  For  examples  and  details  about  clustering  algorithms,  although  mainly 
for  continuous  variables,  see  Azzalini  and  Scarpa  (2012),  Everitt  et  al.  (2011),  Fraley  and 
Raftery  (2002),  Hastie  et  al.  (2009,  Sec.  14.3),  and  Kaufman  and  Rousseeuw  (1990).  Booth 
et  al.  (2008)  proposed  a  multilevel  linear  mixed  model,  having  the  feature  that  observations 
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from  the  same  cluster  are  correlated  because  they  share  cluster-specific  random  effects.  One 
of  the  parameters  in  the  model  is  the  true  underlying  partition  of  the  data,  and  the  posterior 
distribution  of  this  parameter  is  used  to  cluster  the  data.  Hitchcock  and  Chen  (2008)  showed 
advantages  to  smoothing  the  dissimilarities  before  clustering  binary  data,  by  smoothing  the 
proportion  estimates  of  the  agreements  and  disagreements  on  the  p  variables.  Friedman  and 
Meulman  (2004)  and  Hunt  and  Jorgensen  (1999)  considered  clustering  with  mixed  categorical 
and  continuous  variables. 


EXERCISES 

Applications 

15.1  Refer  to  the  classic  use  of  discriminant  analysis  by  Fisher  (1936)  for  Iris  flower 
data  as  discussed  in  the  article  “Iris  flower  data  set”  at  Wikipedia.  Conduct  a  linear 
discriminant  analysis  using  the  data  given  there  for  the  versicolor  and  virginica 
species,  with  sepal  length  and  petal  length  as  explanatory  variables.  Use  7To  =  0.50. 
Report  the  linear  discriminant  function  and  show  the  cross-validated  classification 
table. 

15.2  For  the  classification  tree  shown  in  Figure  1 5.2,  explain  what  the  prediction  would 
be  at  each  terminal  node  if  the  cost  of  misclassifying  a  person  as  remaining  in  the 
program  when  he/she  actually  disenroll  were  (a)  7  times  and  (b)  equal  to,  the  cost 
of  misclassifying  a  person  as  disenrolling  when  they  actually  remain. 

15.3  Refer  to  the  previous  exercise.  Suppose  you  were  to  conduct  a  binary  regression 
analysis  of  these  data.  Summarize  the  advantages  and  disadvantages  of  this  ap¬ 
proach  compared  with  the  classification  tree  analysis. 

15.4  Figure  15.4  is  a  classification  tree  obtained  for  the  horseshoe  crab  data  with 
explanatory  variables  width  and  quantitative  color  (as  in  Section  15.1.2),  using 
the  rpart  and  prime  functions  in  R,  with  the  complexity  parameter  set  at  0.02 
for  the  pruning.  For  example,  of  the  88  crabs  with  width  >25.85  cm  and  color  in 
the  three  lowest  categories  (1,2,  3),  88  crabs  were  predicted  to  have  satellites;  in 
fact,  75  had  them  but  13  did  not. 

a.  Summarize  what  the  terminal  nodes  tell  you. 

b.  Construct  the  classification  table.  (This  is  not  strictly  comparable  to  Table  15.1 
for  logistic  modeling  and  discriminant  analysis,  because  that  table  used  cross- 
validation.) 

c.  For  crabs  having  width  <25.85  cm,  explain  how  the  two  terminal  nodes  having 
color  in  one  of  the  two  lightest  categories  give  predictions  (relative  to  each  other) 
that  contradict  what  the  logistic  model  fit  suggests. 

15.5  Use  the  classification  tree  method  with  the  horseshoe  crab  data  (available  at  the  text 
website),  assuming  equal  misclassification  costs  and  all  four  explanatory  variables. 
Specify  the  criteria  you  chose  to  build  and  prune  the  tree.  Explain  how  to  interpret 
the  pruned  tree,  and  explain  how  (if  at  all)  it  conflicts  with  the  results  from  the 
model-building  of  Section  6.1.4. 
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Figure  1S.4  Pruned  classification  tree  for  horseshoe  crabs. 


15.6  For  the  previous  exercise,  compare  this  method’s  classification  accuracy  for  these 
data  to  that  for  (a)  a  logistic  regression  model  with  the  same  predictors  and  (b)  a 
linear  discriminant  analysis  with  the  same  predictors.  To  keep  results  comparable, 
either  use  cross-validation  in  all  cases  or  do  not  use  it  in  all  cases. 

15.7  Using  the  spam  data  set  at  the  website  www-stat.stanford.edu/~tibs/ 
ElemStatLearn  for  Hastie  et  al.  (2009),  use  two  methods  presented  in  this  chap¬ 
ter  to  classify  whether  a  given  email  is  spam.  Explain  how  you  implemented  the 
methods,  form  classification  tables,  and  summarize  results. 

15.8  For  the  cluster  analysis  example  in  Section  1 5.3.4,  the  observations  were  the  same 
for  New  Jersey  and  Pennsylvania  as  llinois,  and  the  same  for  North  Carolina  as 
Virginia.  Conduct  and  interpret  a  cluster  analysis  using  these  three  states  together 
with  the  14  states  in  Table  15.5. 

15.9  For  the  previous  exercise,  conduct  a  cluster  analysis  using  the  full  data  set  at  the 
text  website.  Explain  how  you  implemented  the  method,  and  interpret  results. 

15.10  The  grounds  on  which  a  divorce  of  a  marriage  can  be  sought  vary  from  state  to 
state.  Table  15.6  shows  data  for  eight  states.  The  complete  data  are  at  the  text 
website. 

a.  For  a  cluster  analysis  to  identify  groups  of  similar  states,  does  it  make  sense 
to  use  a  symmetric  dissimilarity  index,  such  as  the  proportion  of  grounds 
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Table  15.6  Statewide  Data  on  Grounds  for  Divorce,  Where  1  =  Yes  and  0  =  No 


State 

Grounds  for  Divorce 

1 

2 

3 

4 

5 

6 

7 

8 

9 

California 

1 

0 

0 

0 

0 

0 

0 

1 

0 

Florida 

1 

0 

0 

0 

0 

0 

0 

1 

0 

Illinois 

0 

1 

1 

0 

1 

1 

1 

0 

0 

Massachusetts 

1 

1 

1 

1 

1 

1 

1 

0 

1 

Michigan 

1 

0 

0 

0 

0 

0 

0 

0 

0 

New  York 

0 

1 

1 

0 

0 

1 

0 

0 

1 

Texas 

1 

1 

1 

0 

0 

1 

0 

1 

1 

Washington 

1 

0 

0 

0 

0 

0 

0 

0 

1 

Note-.  The  grounds  are  (1)  incompatibility,  (2)  mental  cruelty,  (3)  desertion,  (4)  nonsupport,  (5)  alcohol  abuse, 
(6)  felony,  (7)  impotence,  (8)  insanity,  (9)  separation. 

Source :  From  p.  1516  of  SAT/STAT  9.2  User’s  Guide:  The  DISTANCE  Procedure,  ©  2008  SAS  Institute  Inc., 
Cary,  NC,  USA.  All  Rights  Reserved.  Reproduced  with  permission  of  SAS  Institute  Inc. 


that  differ,  or  an  asymmetric  measure  such  as  the  Jaccard  dissimilarity  index? 
Explain. 

b.  For  the  dissimilarity  method  you  chose,  show  the  first  two  steps  of  an  agglom- 
erative  hierarchical  approach  for  the  observations  in  Table  15.6. 

15.11  For  the  previous  exercise,  conduct  a  cluster  analysis  using  the  full  data  set  at  the 
text  website. 

a.  Using  the  Jaccard  dissimilarity  index,  show  that  the  nine-cluster  solution  has 
four  clusters  with  single  states,  a  cluster  of  ten  states  that  have  the  same  responses 
as  Michigan,  a  cluster  of  four  states  that  have  the  same  responses  as  Florida  and 
California,  a  cluster  of  four  states  that  have  the  same  responses  as  Washington, 
a  cluster  of  25  states  including  Massachusetts  and  Texas,  and  a  cluster  of  three 
states  including  New  York. 

b.  Show  the  result  of  a  two-cluster  solution.  Explain  your  choices  for  implementing 
the  method,  and  display  the  dendrogram  and  interpret  results. 

15.12  Project:  Go  to  a  site  with  large  data  files,  such  as  the  UCI  Machine  Learning  Reposi¬ 
tory  (archive  .  ics .  uci  .  edu/ml)  or  Yahoo!  Webscope  (webscope  .  sandbox . 
yahoo .  com).  Find  a  data  set  of  interest  to  you  that  has  a  categorical  response 
variable.  Use  at  least  one  method  presented  in  this  chapter  to  analyze  the  data. 
Summarize  your  analyses  in  a  two-page  report,  attaching  an  appendix  showing 
your  use  of  software. 

Theory  and  Methods 

15.13  Assuming  that  (Y |  Y  =  j )  has  a  multivariate  E)  distribution,  j  =  0,  1,  derive 
the  logistic  expression  for  [P(Y  =  l|x)]  given  at  the  beginning  of  Section  15.1.1. 

15.14  For  a  binary  classification  tree,  explain  why  the  number  of  nodes  T  relates  to  the 
number  of  terminal  nodes  f  by  T  =2  f  —  1 . 
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15.15  For  applications  of  cluster  analysis  such  as  to  market  basket  data  with  extremely 
large  p  and  a  very  high  proportion  of  0  responses,  explain  why  the  dissimilarity 
index  ( b  +  c)/ p  (in  the  notation  of  Table  15.4)  may  not  be  appropriate. 

15.16  For  a  2  x  2  table  for  two  binary  variables,  what  clusters  would  be  formed  in  a 
cluster  analysis,  under  the  constraint  that  two  observations  in  the  same  cluster  must 
have  a  dissimilarity  value  no  greater  than  0? 

15.17  Explain  the  similarities  and  differences  between  cluster  analysis  and  latent  class 
analysis,  in  terms  of  sampling  assumptions  for  the  methods  and  in  terms  of  the  sorts 
of  conclusions  that  are  reached.  Illustrate  using  the  full  data  at  the  text  website  for 
the  election  results  example  in  Section  15.3.4. 

15.18  Do  a  literature  search  and  write  a  two-page  paper  describing  cluster  analysis  meth¬ 
ods  that  are  available  when  the  observed  features  are  multicategory  rather  than 
binary.  For  any  method  described,  explain  whether  it  treats  variables  as  nominal  or 
ordinal. 

15.19  Based  on  reading  appropriate  literature,  prepare  a  two-page  report  summarizing  the 
(a)  bagging  or  (b)  random  forest  approach  to  classification  trees.  In  the  final  para¬ 
graph  of  your  report  specify  the  method’s  advantages  and  disadvantages  compared 
with  other  methods. 


CHAPTER  16 


Large-  and  Small- Sample  Theory  for 
Multinomial  Models 


This  chapter  gives  a  unified  presentation  of  the  large-sample  theory  and  small-sample 
theory  that  we’ve  used  in  this  book  for  parametric  models  for  categorical  data.  The  primary 
emphasis  is  on  multinomial  models  for  contingency  tables. 

In  Section  16. 1  we  review  and  extend  the  delta  method  for  deriving  large-sample  normal 
distributions  for  many  statistics.  In  Section  16.2  we  apply  the  delta  method  to  estimators 
of  parameters  in  models  for  contingency  tables,  later  illustrated  in  Section  16.4  for  lo¬ 
gistic  and  loglinear  models.  In  Section  16.3  we  derive  large-sample  distributions  of  cell 
residuals  and  the  X2  and  G2  goodness-of-fit  statistics.  We’ll  see  that  powerful  results 
can  follow  from  simple  mathematical  ideas,  such  as  Taylor  series  expansions.  In  Sections 
16.5  and  16.6  we  present  the  theory  for  small-sample  tests  and  confidence  intervals  for 
proportions  and  parameters  for  contingency  tables.  The  emphasis  throughout  is  on  ML  in¬ 
ference,  but  the  final  section  mentions  alternative  approaches  that  have  similar  large-sample 
properties. 

The  results  in  this  chapter  have  a  long  history.  Pearson  (1900)  derived  the  limiting  chi- 
squared  distribution  of  X 2  for  testing  a  specified  multinomial  distribution.  Fisher  (1922, 
1924)  showed  the  degrees  of  freedom  adjustment  when  multinomial  probabilities  are  func¬ 
tions  of  unknown  parameters.  Cramer  (1946,  pp.  424^434)  formally  proved  this  result, 
under  the  assumption  that  ML  estimators  of  the  parameters  are  consistent.  Rao  (1957) 
proved  consistency  of  the  ML  estimators  and  derived  their  asymptotic  distribution  under 
general  conditions.  Birch  (1964a)  proved  these  results  under  weaker  conditions.  For  small 
samples,  significance  tests  generalize  Fisher’s  (1935a)  conditional  approach  for  Fisher’s 
exact  test,  and  confidence  intervals  generalize  work  by  Clopper  and  Pearson  (1934)  using 
small-sample  distributions  such  as  the  binomial. 


16.1  DELTA  METHOD 

Suppose  that  a  statistic  used  to  estimate  a  parameter  has  a  large-sample  normal  distribution. 
In  this  section  we  show  that  many  functions  of  that  statistic  are  also  asymptotically  normal. 
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16.1.1  O,  o  Rates  of  Convergence 

Big  O  and  little  o  notation  is  useful  for  describing  limiting  behavior  of  sequences.  For  real 
numbers  (z„},  the  little  o  notation  o(z„ )  represents  a  term  that  has  smaller  order  than  z„  as 
n  — >  oo,  in  the  sense  that  o(z„)/z„  — >  0  as  n  — >  oo.  For  instance,  ^Jn  is  o(n)  as  n  —>  oo, 
since  „ fn/n  —*■  0  as  n  — >  oo.  A  sequence  that  is  o(  1)  satisfies  o(l)/l  =<o(l)  — >  0;  for 
instance,  n~]^2  is  o(l)  as  n  — >  oo. 

The  big  O  notation  <9(z„)  represents  terms  that  have  the  same  order  of  magnitude  as  z„, 
in  the  sense  that  |<9(z„)/z„|  is  bounded  as  n  — >  oo.  For  instance,  (3 /«)  +  (8//j2)  is  0(n~l) 
as  n  — >  oo;  dividing  it  by  /r_l  gives  a  ratio  that  takes  value  close  to  3  for  large  n. 

Similar  notation  applies  to  sequences  of  random  variables.  This  notation  uses  a  subscript 
p  to  indicate  that  the  sequence  has  probabilistic  rather  than  deterministic  behavior.  The 
symbol  op{z„)  denotes  a  random  variable  of  smaller  order  than  z„  for  large  n,  in  the  sense  that 
op{z„)/zn  converges  in  probability  to  0;  that  is,  for  any  fixed  e  >  0,  P(\op  (z„ )/z„ |  <  e)  — >  1 
as  n  — >  oo.  The  notation  Op(z„ )  represents  a  random  variable  such  that  for  every  e  >  0, 
there  is  a  constant  K  and  an  integer  n0  such  that  P[\Op(z„)/z„\  <  K]  >  1  —  e  for  all 
n  >  no. 

For  the  sample  mean  Y„  of  n  independent  observations  Y\, . . .,  Yn  from  a  distribution 
having  E(Y,)  =  p,  (Y„  —  p)  =  op(  1),  since  (F„  —  p)/\  converges  in  probability  to  0  as 
n  — >  oo  by  the  law  of  large  numbers.  By  Tchebychev’s  inequality,  the  difference  between 
a  random  variable  and  its  expected  value  has  the  same  order  of  magnitude  as  the  standard 
deviation  of  that  random  variable.  Since  Y„  —  p  has  standard  deviation  a  j ~Jn,  (Y„  —  p)  = 
Opin-V1). 

A  random  variable  that  is  Op(n~'/2)  is  also  op(  1 ).  An  example  is  (Y„  —  p).  Multiplication 
affects  the  order  in  a  natural  manner  (Exercise  16.5).  If  the  difference  between  two  random 
variables  is  op(  1 )  as  n  —>  oo,  Slutzky’s  theorem  states  that  those  random  variables  have  the 
same  limiting  distribution. 

16.1.2  Delta  Method  for  a  Function  of  a  Random  Variable 

Let  Tn  denote  a  statistic,  the  subscript  expressing  its  dependence  on  the  sample  size  n.  For 
large  samples,  suppose  T„  has  approximately  a  normal  distribution  with  mean  6  and  standard 
error  o/^fn.  More  precisely,  as  n  — >  oo,  the  cdf  of  y/n{Tn  —  0)  converges  to  a  7V(0,  a2) 
cdf.  This  limiting  behavior  is  an  example  of  convergence  in  distribution,  denoted  by 

y/H(Ta  -6>)4  A(0,a2).  (16.1) 

For  a  function  g,  we  now  derive  the  limiting  distribution  of  g(T„).  Suppose  that  g  is  at 
least  twice  differentiable  at  0.  By  the  Taylor  series  expansion  of  g(t)  in  a  neighborhood  of 
0,  for  some  0*  between  t  and  0, 

git)  =  gm  +  (t-  e)g'(0)  +  u-  efg"(e*)/2 
=  g(0)  +  (t-0)g'(0)  +  O(\t-0\2). 

Substituting  the  random  variable  Tn  for  t, 

s/n[g(T„)  -  g(0)]  =  -Jn  (T„  -  0)g'(0)  +  -Jn  0(\Tn  -0|2) 

=  y/R(T„  -0)g'm+Op{n-'l2) 


(16.2) 
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since 


s/nO(\T„  -  0\ 2)  =  ^7lO[Op{n-1)]  =  Op{n~ {'2). 

The  Op{n~{/1)  term  is  asymptotically  negligible,  so  sfn[g{T„)  —  g(0)]  has  the  same  lim¬ 
iting  distribution  as  —  0)g'{0)\  that  is,  g{Tn)  —  g(0)  behaves  like  the  constant  mul¬ 

tiple  g'(0 )  of  (T„  —  0).  Now,  (T„  —  0)  is  approximately  normal  with  variance  a1  In.  Thus, 
g(T„)  —  g(0)  is  approximately  normal  with  variance  cr2[g'(0)]2/n.  More  precisely, 

V^[g(T„)  -  gm  -4  N( 0,  a2[g’m2).  (16.3) 

Figure  3.1  illustrated  this  result,  and  in  Section  3.1.6  we  applied  it  to  the  sample  logit. 

Result  (16.3)  is  called  the  delta  method  for  obtaining  asymptotic  distributions.  Since 
o2  =  cr2(0)  and  g'(0)  usually  depends  on  0,  the  asymptotic  variance  is  unknown.  Let  cr2(7'„) 
and  g\T„)  denote  these  terms  evaluated  at  the  sample  estimator  T„  of  0.  When  g'(-)  and 
a  =  cr(-)  are  continuous  at  0,  then  o{Tn)g\T„)  is  a  consistent  estimator  of  o{0)g’(0).  Thus, 
Wald  confidence  intervals  and  tests  use  the  result  that  */n  lg(T„)  —  g(0)]/cr(Tn)\g'(Tn)\  is 
asymptotically  standard  normal.  For  instance, 

g(Tn)  ±  zo/2  a(T„)\g\T„)\/Vn 

is  a  large-sample  100(1  —  a)%  Wald  confidence  interval  for  g(6). 

16.1.3  Delta  Method  for  a  Function  of  a  Random  Vector 

The  delta  method  generalizes  to  functions  of  random  vectors.  Suppose  that  T n  — 
(T„i,  . . . ,  T„n)t  is  asymptotically  multivariate  normal  with  mean  0  =  (0\ , . . . ,  0N)r 
and  covariance  matrix  Z/n.  Suppose  that  g(t\,  . ...  t^  )  has  a  nonzero  differential  0  = 
(01 , . . . ,  0/v)r  at  0,  where 


Then, 


V^U(T„)-g(0)]4(V(O,0rZ0).  (16.4) 

For  large  n ,  g(T„)  has  distribution  similar  to  the  normal  with  mean  g(0)  and  variance 

0rZ0/«. 

The  proof  of  (16.4)  follows  from  the  expansion 

g(Tn)  -  g(0)  =  C Tn  -  0)T<j>  +  o(\\T „  -  0\\), 


where  ||z||  =  y  J2i  zf  denotes  the  length  of  vector  z.  For  large  n,  g{T „)  —  g(0)  behaves 
like  a  linear  function  of  the  approximately  normal  random  vector  ( T„  —  0).  Thus,  it  itself 
is  approximately  normal. 
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16.1.4  Asymptotic  Normality  of  Functions  of  Multinomial  Counts 

The  delta  method  for  random  vectors  implies  asymptotic  normality  of  many  functions  of 

multinomial  cell  counts  . . . njv)  in  contingency  tables  with  cell  probabilities  n  = 

(tt i . ttn)t .  Let  p  =  {p\ , . . . ,  pn)T  denote  the  sample  proportions,  where  /?,  =  n-,/ n 

with  n  —  n\  +  ■  •  •  +  ««.  Denote  observation  i  by  F,  =  (F,|,  . . . ,  Y in),  where  F,y  =  1  if  it 
falls  in  cell  j  and  Y,j  =  0  otherwise,  i  =  1 Since  each  observation  falls  in  only  one 
cell,  J2  j  Yij  =  *  and  F,yF(Jt  =  0  when  j  ^  k.  Also,  pj  =  Fy/«,  and 


E(Yij)  =  P(Yjj  =l)  =  „J  =  E<Jr),  E(YijYlk)  =  0  if  j  #  k. 
It  follows  that 

E(Y  i)  =  7i  and  cov(F,)=E,  i  =  l,...,n, 


where  Y  =  (ojk)  with 

ojj  =  var(F,y)  =  E(Y2)  -  {E(Y,I)\2  =  jtj(  1  -  : r,), 

Ojk  =  co v(Fy,  F,*)  =  E(YijYik)  -  E{Yu)E(Yik)  =  -j Tj7tk  for  j  ^  A'. 


The  matrix  Y  has  form 


Y  =  Diag(rr)  -  TtTi1 , 

where  Diag(tr)  is  the  diagonal  matrix  with  the  elements  of  jr  on  the  main  diagonal. 

Since  p  =  (£T  F,)/«  is  a  sample  mean  of  «  independent  observations, 

cov(p)  =  [Diag(ir)  —  nn'  ]/n.  (16.5) 

This  covariance  matrix  is  singular,  because  of  the  linear  dependence  pj  =  1.  The 
multivariate  central  limit  theorem  (Rao  1973,  p.  128)  implies 

sfn  (p  —  ji)  -a-  N[0,  Diag(^)  —  nn'  ].  (16.6) 

By  the  delta  method,  functions  of  p  having  nonzero  differential  at  n  are  also  asymp¬ 
totically  normal.  Let  g{t\ , ....  t^)  be  a  differentiable  function,  and  let  0,  =  dg/diti  denote 
dg/dtj  evaluated  at  t  =  ji.  By  the  delta  method  (16.4), 

Vn  [g(p)  -  5(jt)]  4  (V(0,  0r[Diag(jr)-jrjrr10).  (16.7) 


The  asymptotic  variance  equals 

I 

0rDiag(jr)0  -  (4>Tn)2  =  ,0,2 


In  Section  3.1.7  we  used  this  formula  to  derive  the  large-sample  variance  of  the  sample  log 
odds  ratio. 
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16.1.5  Delta  Method  for  a  Vector  Function  of  a  Random  Vector 

The  delta  method  generalizes  further  to  a  vector  of  functions  of  an  asymptotically  normal 
random  vector.  Let  g(t)  =  ( g\(t ),  . . . ,  gq{t))T  and  let  (dg/dd)  denote  the  q  x  N  Jacobian 
matrix  for  which  the  entry  in  row  /  and  column  j  is  dgj(t)/dtj  evaluated  at  t  —  0.  Then, 

V^[g(Tn)  -  g(0)]  4  N[ 0,  (3 g/d0)  T(dg/d6)T].  (16.8) 

The  rank  of  the  limiting  normal  distribution  equals  the  rank  of  the  asymptotic  covariance 
matrix. 

This  expression  is  useful  for  finding  large-sample  joint  distributions.  For  instance,  from 
(16.6),  (16.7),  and  (16.8),  the  asymptotic  joint  distribution  of  several  functions  of  multino¬ 
mial  proportions  has  covariance  matrix  of  the  form 

asymp.  cov(v//i  [g(p)  —  g(jr)])  =  <F[Diag(jr)  —  nnl  ]<6r, 

where  <6  is  the  Jacobian  (dg/dit). 


16.1.6  Joint  Asymptotic  Normality  of  Log  Odds  Ratios 

We  illustrate  formula  (16.8)  by  finding  the  asymptotic  joint  distribution  of  a  set  of  log  odds 
ratios  in  a  contingency  table.  Let  g(tr)  =  log(jr)  denote  the  vector  of  natural  logs  of  cell 
probabilities,  for  which 


dg/dn  =  Diag(7r)~'. 

The  covariance  of  the  asymptotic  distribution  of  y/n  |log( p)  —  log(w)]  is 

Diag(w)~'[Diag(w)  —  nn'  ]Diag(jr)“l  =  Diag(jr)-1  —  1 1 7 , 

where  1  is  an  N  x  1  vector  of  1  elements. 

For  a  q  x  N  matrix  of  constants  C,  it  follows  that 

yfn  C[!og( p)  -  logOr)]  4  N[ 0,  CDiagUT1  CT  -  CUTCT].  (16.9) 

Now,  suppose  C  log( p)  is  a  set  of  sample  log  odds  ratios.  Then,  each  row  of  C  contains 
zeros  except  for  two  +1  elements  and  two  —1  elements  in  the  positions  multiplied  by 
the  relevant  elements  of  log(  p)  to  form  the  given  log  odds  ratio.  The  second  term  in  the 
covariance  matrix  in  (16.9)  is  then  zero.  If  a  particular  odds  ratio  uses  the  cells  numbered 
h,  i,j,  and  k,  the  variance  of  the  asymptotic  distribution  is 

asymp.  var [y/n  (sample  log  odds  ratio)]  =  nh  1  +  ni  1  +  n ]  1  +  nk  1 . 

When  two  log  odds  ratios  have  no  cells  in  common,  their  asymptotic  covariance  in  the 
limiting  normal  distribution  equals  zero. 


592 


LARGE-  AND  SMALL-SAMPLE  THEORY  FOR  MULTINOMIAL  MODELS 


16.2  ASYMPTOTIC  DISTRIBUTIONS  OF  ESTIMATORS  OF  MODEL 
PARAMETERS  AND  CELL  PROBABILITIES 

We  now  derive  basic  results  of  large-sample  model-based  inference  for  contingency  tables. 
The  delta  method  is  the  key  tool.  The  derivations  apply  to  a  single  multinomial  distribution, 
but  extend  directly  to  products  of  multinomials  for  independent  samples. 

The  observations  are  counts  n  —  (n  i , . . . ,  hn)t  in  N  cells  of  a  contingency  table.  The 
asymptotics  regard  TV  as  fixed  and  let  n  =  00  •  We  assume  that  n  —  np  has  a  multi¬ 

nomial  distribution  with  probabilities  n  =  (tt\,  . . . ,  jtn)T  ■  In  general  terms,  the  model  is 

JT  =  7l(0), 

where  jt(0)  denotes  a  function  that  relates  it  to  a  smaller  number  of  parameters 
0  =  (0\ , . . . ,  0q)T .  We  use  6  and  n  to  denote  generic  parameter  and  probability  values, 
and  0O  =  (0io>  •  •  • ,  Qqo)T  and  jt0  =  Ctr10,  •  •  - ,  ttno)T  —  tr(0o)  to  denote  true  values  for  a 
particular  application.  When  the  model  does  not  hold,  no  0O  exists  for  which  tr(0o)  =  xo\ 
that  is,  jto  falls  outside  the  subset  of  n  values  that  is  the  range  of  ji(0)  for  the  space  of 
possible  0.  We  consider  this  case  in  Section  16.3.5. 

We  first  derive  the  asymptotic  distribution  of  the  ML  estimator  0  of  0.  We  use  that  to 
derive  the  asymptotic  distribution  of  the  model-based  ML  estimator  ft  =  ji(0)  of  n.  The 
assumed  regularity  conditions  are: 

1.  0o  is  not  on  the  boundary  of  the  parameter  space. 

2.  All  7T;o  >  0. 

3.  ji(0)  has  continuous  first-order  partial  derivatives  in  a  neighborhood  of  0(). 

4.  The  Jacobian  matrix  (9tr/90)  has  full  rank  q  at  0q- 

These  conditions  ensure  that  tt(0)  is  locally  smooth  and  one-to-one  at  0o  and  Taylor  series 
expansions  exist  in  neighborhoods  around  0<)  and  Kq. 

As  in  the  Cramer  (1946)  and  Rao  (1957)  proofs,  the  derivations  regard  the  ML  estimate 
as  a  point  in  the  parameter  space  where  the  derivative  of  the  log-likelihood  function  is  zero. 
Birch  (1964a)  regarded  it  as  a  point  at  which  the  likelihood  takes  value  arbitrarily  near 
its  supremum.  Although  his  approach  is  more  powerful,  the  proofs  are  more  complex.  In 
assuming  that  an  ML  estimator  of  0  exists  and  is  a  solution  of  the  likelihood  equations, 
we  require  a  strong  identifiability  condition:  For  every  e  >  0,  there  exists  a  8  >  0  such  that 
if  ||0-0oll  >  e,  then  \\ti(0)  —  troll  >  <5.  This  condition  implies  a  weaker  one  that  two  0 
values  cannot  have  the  same  n  value.  When  strong  identifiability  and  the  other  regularity 
conditions  hold,  the  probability  an  ML  estimator  is  a  root  of  the  likelihood  equations 
converges  to  1  as  n  -*■  oo.  That  estimator  then  has  the  standard  asymptotic  properties  of  a 
solution  of  the  likelihood  equations.  For  proofs  with  slightly  weaker  regularity  conditions, 
see  Rao  (1973,  Sec.  5e)  and  Bishop  et  al.  (1975,  Secs.  14.7  and  14.8). 

16.2.1  Asymptotic  Distribution  of  Model  Parameter  Estimator 

Suppose  that  observations  are  independent  from  f{y\0),  some  probability  mass  function. 
The  ML  estimator  0  is  efficient,  in  the  sense  that 

V«(0  -0)4  n(o,  j~l). 
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where  J  is  the  information  matrix  for  a  single  observation.  The  (j,  k )  element  of  J  is 

_E  f  9 2  log  f(y,  0)\  _  £  f  3  log  f(y,  0)  9  log  f(y,  0)~ 

V  90,90*  )~  [  36  j  '  90* 

When  /  is  the  probability  of  an  observation  having  multinomial  probabilities 
(tc\(0),  . . . ,  7Uv(0)),  this  element  of  J  equals 

9  log(7T, ■(<>))  9  log(7r,(0))^  _  A  9tt,(0)  djtj(0)  1 

^  90,  90*  ^  ^  90;  90*  7T;(0)' 

/  =  1  '  1  =  1  J 

We’ll  express  these  elements  more  simply  in  matrix  form.  Let  A  denote  the  N  x  q  matrix 
having  elements 


aij  —  71 


-1/2 

(0 


9tt,(6>) 

d0jO 


The  matrix  expression  for  A  is 

A  =  Diag(7rorl/2(9jr/90o),  (16.10) 

where  (9jt/90o)  denotes  the  Jacobian  ( drc/dO )  evaluated  at  0Q.  So,  the  above  element  of  J 
equals  the  (j,  k)  element  of  A1  A.  Since  the  Jacobian  has  full  rank  at  0O,  A'  A  is  nonsingular. 
Thus, 


Jn{0  -0O)4  /V[0,  (ArA)"'].  (16.11) 

The  asymptotic  covariance  matrix  of  0  depends  on  (dn /90o)  and  hence  on  the  function  for 
modeling  n  in  terms  of  0. 

16.2.2  Asymptotic  Distribution  of  Cell  Probability  Estimators 

The  asymptotic  distribution  of  the  model-based  estimator  A  follows  from  the  Taylor  series 
expansion 


n  =  n(0)  =  i r(0o)  +  |^(0 

900 


0o)  +  O>-1/2). 


(16.12) 


The  size  of  the  remainder  term  follows  from  (0  —  0o)  =  Op(n~]/2). 

Now  tr(0n)  =  tro,  and  sfn  (0  —  0o)  is  asymptotically  normal  with  asymptotic  covariance 
(A1  A)-1.  By  the  delta  method, 


~Jn  (n  —  ttq)  —>  N 


9  71  ry.  1  9  7Z 

0, - (Ar  A)~  — — 

,90o  90o 


(16.13) 


The  marginal  approximation  for  a  particular  probability  that  is  very  close  to  0  may  require 
quite  large  n  to  be  good. 
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16.2.3  Model  Smoothing  Is  Beneficial 

When  the  model  holds  with  9  having  q  <  N  —  1  elements,  A  =  n (0)  is  more  efficient 
than  the  sample  proportion  p  for  estimating  n .  More  generally,  for  estimating  a  smooth 
function  g(n)  of  n ,  g(A)  has  smaller  asymptotic  variance  than  g(p).  Altham  (1984)  proved 
this  result.  Her  proof  applies  not  only  to  categorical  data  but  to  any  situation  in  which  a 
model  describes  the  dependence  of  a  set  of  parameters  on  some  smaller  set.  The  proof  uses 
standard  properties  of  ML  estimators  and  applies  whenever  regularity  conditions  hold  that 
guarantee  those  properties. 

Let  T,  =  Diag(tr)  —  nn1  denote  the  covariance  matrix  of  *Jn  p.  By  the  delta  method, 


asymp.  var[«/fl  g(p)] 


dn0) 


[cov(Vn  p)] 


9jto 


djt  o  /  duo 


and 


asymp.  var[v/ng(tr)]  = 


[asymp.  co v(+/n  A)] 


_9g_ 

dnQ 


U  JL  !—  /v 

—  [asymp.  cov(V»0)] 
ou  o 


dg_ 

dn0 


From  (16.10)  and  (16.1 1), 

asymp.  co v(^/nO)  =  (AT A)  1  =  [(dn /dOo)TDiag(jio)~'(dn /d6Q)]~l . 


16.3  ASYMPTOTIC  DISTRIBUTIONS  OF  RESIDUALS  AND 
GOODNESS-OF-FIT  STATISTICS 

We  next  derive  the  distribution  of  the  Pearson  X 2  and  likelihood-ratio  G 2  goodness-of-fit 
statistics  for  a  multinomial  model  n  =  n(9).  We  first  derive  the  asymptotic  joint  distribution 
of  the  sample  proportions  p  and  model-based  estimator  A.  This  distribution  determines 
large-sample  distributions  of  statistics  that  depend  on  both  p  and  A,  such  as  residuals. 
Deriving  the  large-sample  chi-squared  distribution  for  X2,  which  is  the  sum  of  squared 
Pearson  residuals,  is  then  straightforward.  We  also  show  that  X2  and  G2  are  asymptotically 
equivalent,  when  the  model  holds.  The  presentation  borrows  from  Bishop  et  al.  (1975, 
Chap.  14),  Cox  (1984),  Cramer  (1946,  pp.  432^133),  and  Rao  (1973,  Sec.  6b). 


16.3.1  Joint  Asymptotic  Normality  of  p  and  A 

We  first  express  the  joint  dependence  of  p  and  A  on  p,  in  order  to  show  the  joint  asymptotic 
normality  of  p  and  A.  Let 


D  =  Diag(jTo)l/2^(/lrA)  'ArDiag(jro)  l/2. 


ASYMPTOTIC  DISTRIBUTIONS  OF  RESIDUALS  AND  GOODNESS-OF-FIT  STATISTICS 
From  (16.11)  and  (16.12), 

ft  -n0  =  ^-(0  -  00)  +  op(n~'/2)  =  D(p  -  7C0)  +  op(n~]/2). 


Therefore, 


^(a-100)  =  {Id)^{p-710)  +  o^ 

where  /isaiV  x  JV  identity  matrix.  By  the  delta  method, 

J~n(  4  N(0,  Z*), 

\J[  —  J[0  J 

where 
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(16.14) 


T*  (  Diag(7r0)  —  7i0jil  [Diag(wo)- \ 

£>[Diag(7r0)  —  Jro^o  1  D[Diag(jr0)  -  \D7  )  ' 

The  two  matrix  blocks  on  the  main  diagonal  of  E*  are  co \(^/n  p)  and  asymp.  co v(y/n  ft ),  de¬ 
rived  previously.  The  new  information  here  is  that  asymp.  cov(V«  P ,  -Jnft)  =  [Diag(jr0)  - 
jcqjcI  ]D7 . 


16.3.2  Asymptotic  Distribution  of  Pearson  and  Standardized  Residuals 

For  cell  counts  {«,-}  the  Pearson  statistic  is  X2  =  ej,  where 


«/  -  A;  Vn(Pi-fti) 


For  Poisson  models,  this  is  the  Pearson  residual.  The  residuals  e  are  functions  of  p  and 
ft ,  which  are  jointly  asymptotically  normal  from  (16.15).  To  use  the  delta  method,  we 
calculate 

dej/dpi  =  Vn7f“l/2,  de,  /  d  ft  j  =  ~Vn(pi  +  ftj)/2ft~3/2, 

dej/dpj  =  dej/dftj  =  0  for  /  ^  j. 

That  is, 

—  =  *Jn  Diag(/f  )-1/2  and 
dp 

= -Qj  \/«[Diag(p)  +  Diag(^)]Diag(A)_3/2.  (16.16) 


Evaluated  at  p  =  jt0  and  ft  =  jt0,  these  matrices  equal  v^Diagjrr o)  1/2  and 
—  ,Jn  Diag(rro)_l/2.  Using  (16.16),  (16.17),  and  AT71q2  =  0  [which  follows  from 
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E ,  3M9)/adj  =  d/ddj  [E,  *;(«)]  =  a/ae/D  =  0], 


e  4  N( 0.  I  -  n'0/2(n'0/2)T  -  A(ATA)~]AT).  (16.17) 


The  limiting  distribution  has  form  N( 0,  /  —  Hat),  where  Ha,  is  the  hat  matrix  (Sec¬ 
tion  4.5.6).  The  standardized  residual  (Haberman  1973a)  divides  e  by  its  estimated  standard 
error.  This  statistic,  which  is  asymptotically  standard  normal,  equals 


_ £/ _ 

[1  -  -  Ey  Et(J/*/)Ojr,-/a^)(3^M)^],/2, 


(16.18) 


/V  E  /V 

where  vjk  denotes  the  element  in  row  j  and  column  k  of  (A  A)-1 .  The  denominator  of  r,  is 
i/ 1  —hj,  where  the  leverage  hi  for  observation  i  estimates  the  /th  diagonal  element  of  the 
hat  matrix. 


16.3.3  Asymptotic  Distribution  of  Pearson  A2  Statistic 

The  proof  that  the  Pearson  X2  statistic  has  an  asymptotic  chi-squared  distribution  uses  the 
following  relationship  between  normal  and  chi-squared  distributions  (Rao  1973,  p.  188): 


Let  X  be  multivariate  normal  with  mean  v  and  covariance  matrix  B.  A  necessary  and 
sufficient  condition  for  {X  -  v)TC{X  -  v)  to  have  a  chi-squared  distribution  is  BCBCB  = 
BCB.  The  degrees  of  freedom  equal  the  rank  of  CB. 


When  B  is  nonsingular,  the  condition  simplifies  to  CBC  =  C. 

The  Pearson  statistic  relates  to  e  by  X2  =  eTe,  so  we  apply  this  result  by  identifying 
X  with  e,  v  =  0,  C  =  /,  and  B  =  /  —  Jt]/2(7t]/2)T  —  A{AT A)~l AT .  Since  C  =  /,  the 
condition  for  (A  —  vT)C{X  —  v)  =  eT  e  =  A2  to  have  a  chi-squared  distribution  simplifies 
to  BBB  =  BB.  A  direct  computation  using  AT n \j~  =  0  shows  that  B  is  idempotent,  so 
the  condition  holds.  Since  e  is  asymptotically  multivariate  normal.  A2  is  asymptotically 
chi-squared. 

For  symmetric  idempotent  matrices,  the  rank  equals  the  trace.  The  trace  of  /  is  (V;  the 
trace  of  n\j2(TtQ2)T  equals  the  trace  of  (n\{2)T  n\J  =  E,  ni0  =  1.  which  is  1 ;  the  trace  of 
A(A 7  A)^1  At  equals  the  trace  of  (AT  A)~\AT  A)  —  identity  matrix  of  size  q  x  q,  which 
is  q.  Thus,  the  rank  of  B  =  CB  is  N  —  q  —  1 ,  and  the  asymptotic  chi-squared  distribution 
has  df  =  N  —  q  —  1 . 

The  result,  due  to  Fisher  (1922),  is  remarkably  simple.  When  the  sample  size  is  large, 
the  distribution  of  A2  does  not  depend  on  jcq  or  the  model  form.  It  depends  only  on  the 
difference  between  the  dimension  of  n,  which  is  N  —  1,  and  the  dimension  of  0.  Watson 
(1959)  showed  that  the  same  result  holds  for  the  asymptotic  conditional  distribution,  given  a 
sufficient  statistic  for  nuisance  parameters.  With  q  =  0  parameters.  A2  is  Pearson’s  (1900) 
statistic  (1.16)  for  testing  that  multinomial  probabilities  equal  certain  specified  values. 
Then,  df  =  N  —  1,  as  Pearson  claimed. 
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16.3.4  Asymptotic  Distribution  of  Likelihood-Ratio  Statistic 

When  the  model  holds,  the  likelihood-ratio  statistic  G 2  is  asymptotically  equivalent  to  X 2 
as  n  —*  oo.  To  show  this,  we  express 

G1  -  2  ^  n,  log  =2, Pi  log  ( 1  +  P\  ) 


and  apply  the  expansion 

log(l  +  JC)  =  JC  -  x2/2  +  x2/3 -  for  |.v|  <  1 . 


We  identify  x  with  (p,-  —  which  converges  in  probability  to  0  when  the  model  holds. 

For  large  «, 


’  Pi  -  TCi  ( 

1  \(Pi-Tti)2 

*,  -(i 

V  *f  +'"J 

T  ,  /1\  (Pi  -TCi  f  (Pi-TTi)2  3 

=  2n  2^  |^(P;  ~  Jr/)  -  J - t - 4 - r - 4-  Op(pi  -  7r,)J 

=  «£  (PLT7r')-  4-  2nO  p(n~3/2)  =  X2  +  Op(n~'/2)  =  X2  +  op(  1), 


since  £T(p,  —  tr,)  =  0  and  (p,  —  4,)  =  (p,  —  n,)  —  (4,-  —  7r,),  both  of  which  are 
Op(n~]/2).  Thus,  when  the  model  holds,  X2  —  G2  0.  As  a  consequence,  G2,  like  X2, 
has  an  asymptotic  chi-squared  distribution  with  df  =  N  —  q  —  1. 

The  parameter  value  that  maximizes  the  likelihood  is  the  one  that  minimizes  G2.  To 
show  this,  we  let 


G2(jt;  p)  =  2nJ2  Pi  log  (P/M). 


The  kernel  of  the  multinomial  log-likelihood  is 
L(0)  =  n^p,  log  7r,-(0) 


=  ~nY\  Pi  log  7777  4-  n  Pi  log  Pi 

;  n  i(0)  “ 

=  -  Q)  G2(;r(0);  p)  +  /?  ^  Pi  log  Pi. 


The  second  term  in  the  last  expression  does  not  depend  on  0,  so  maximizing  L(0 )  is 
equivalent  to  minimizing  G2  with  respect  to  0. 

A  fundamental  result  for  G2  concerns  comparisons  of  nested  models.  For  two  models, 
with  Mq  a  special  case  of  M\ ,  let  r/o  and  q\  denote  the  numbers  of  parameters  and  let  {A 0, } 
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and  { 7T i / }  denote  ML  estimators  of  cell  probabilities.  Then 


G2(M0)  -  G2(M\ )  =  In  Pi  log(7ri;/jTo,  ) 


has  the  form  of  — 2(log-likelihood  ratio)  for  testing  that  M0  holds  against  the  alternative 
that  M i  holds.  Theory  for  likelihood-ratio  tests  suggests  that  when  the  simpler  model  holds, 
its  asymptotic  distribution  is  chi-squared  with  df  =  <71  —  q 0.  For  details,  see  Bishop  et  al. 
(1975,  pp.  525-5 26),  Haberman  (1974a,  p.  108),  and  Rao  ( 1 97 3 ,  pp .  4 1 8^1 1 9 ) .  The  stati Stic 
X2(Mo|Mi  )  in  (4.40)  with  v(fio  i)  =  Ao<  >s  aquadratic  approximation  for  the  G 2  difference. 
Haberman  (1977a)  noted  that  these  tests  can  perform  well  even  for  large,  sparse  tables,  as 
long  as  q\  —  qo  is  small  relative  to  the  sample  size  and  no  expected  frequency  has  larger 
order  of  magnitude  than  the  others. 

16.3.5  Asymptotic  Noncentral  Distributions 

Results  in  this  chapter  assume  that  a  certain  parametric  model  holds.  In  practice,  any 
unsaturated  model  almost  surely  does  not  hold  perfectly.  This  is  not  problematic  if  we 
regard  models  merely  as  convenient  approximations  for  reality.  For  instance,  the  ML 
estimator  0  converges  to  a  value  0q  that  describes  the  best  fit  of  the  chosen  model  to  reality. 
In  this  sense,  inferences  for  0  give  us  information  about  a  useful  approximation  for  reality. 

For  goodness-of-fit  statistics,  a  relevant  distinction  exists  between  limiting  behavior 
when  the  model  holds  and  when  it  does  not  hold.  When  the  model  holds,  X2  and  G 2 
have  a  limiting  chi-squared  distribution,  and  the  difference  between  them  disappears  as 
n  increases.  When  the  model  does  not  hold,  X2  and  G 2  tend  to  grow  unboundedly  as  n 
increases,  and  |X2  —  G2|  need  not  go  to  zero.  One  method  for  obtaining  proper  limiting 
distributions  considers  a  sequence  of  situations  n„  for  which  the  lack  of  fit  diminishes  as  n 
increases.  Specifically,  the  model  is  n  =  f(0),  but  in  reality 


Tin  —  f(0)  +  8/*Jn. 


(16.19) 


The  best  fit  of  the  model  to  the  population  has  /th  probability  equal  to  j)(0),  but  the  true 
value  differs  from  that  by  8,  /s/n. 

For  this  representation,  Mitra  (1958)  showed  that  X2  has  a  limiting  noncentral  chi- 
squared  distribution,  with  df  =  N  —  q  —  1  and  noncentrality  parameter 


X  =  n± 
1  =  1 


[TTni  ~  fi(0)]Z 


fi(0) 


This  has  the  form  of  X2,  with  the  sample  values  p,  and  A,  replaced  by  population  values 
it,,,  and  fj(0).  Similarly,  the  noncentrality  of  the  likelihood-ratio  statistic  has  the  form  of 
G2,  with  the  same  substitution.  Haberman  (1974a,  pp.  109-1 12)  showed  that  under  certain 
conditions  G2  and  X2  have  the  same  limiting  distribution;  that  is,  their  noncentrality  values 
converge  to  a  common  value  as  n  — »•  00. 

Representation  (16.19)  means  that,  for  large  «,  the  noncentral  chi-squared  approximation 
is  valid  when  the  model  is  just  barely  incorrect.  In  practice,  it  is  often  reasonable  to  adopt 
(16.19 )  for  fixed  n  to  approximate  the  distribution  of  X2,  even  though  (16.19)  would  not  be 
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plausible  as  we  obtain  more  data.  The  alternative  representation 

n  =  f(0)  +  S  (16.20) 

in  which  jr  differs  from  f(0)  by  a  fixed  amount  as  n  — »  oc  may  seem  more  natural.  In 
fact,  this  is  more  appropriate  than  (16.19)  for  proving  the  test  to  be  consistent  (i.e.,  for 
convergence  to  1  of  the  probability  of  rejecting  the  hypothesis  that  the  model  holds).  For 
(16.20),  however,  the  noncentrality  parameter  X  grows  unboundedly  as  n  —>  oo,  and  a 
proper  limiting  distribution  does  not  result  for  X2  and  G2. 

When  the  model  holds,  8  =  0  in  either  representation  (16.20)  or  (16.21).  That  is,  f(0)  = 
n(0),  X  —  0,  and  the  results  in  Sections  16.3.3  and  16.3.4  apply. 


16.4  ASYMPTOTIC  DISTRIBUTIONS  FOR  LOGIT/LOGLINEAR  MODELS 

For  loglinear  models,  formulas  in  Section  9.6  for  the  asymptotic  covariance  matrices  of  0 
and  jf  are  special  cases  of  ones  derived  in  Section  1 6.2.  We  present  these  for  the  multinomial 
form  of  the  models,  which  relates  directly  to  that  section.  Then  we  discuss  the  connection 
to  Poisson  loglinear  models. 

To  constrain  probabilities  to  sum  to  1,  we  express  loglinear  models  for  multinomial 
sampling  as 


n  =  exp(Y0)/[lrexp(Y0)],  (16.21) 

where  X  is  a  model  matrix  and  lr  =  (l . 1).  Letting  jc,-  denote  row  /  of  Y, 


71  i  =  71,(0) 


exp(jc,0) 
J2k  exp (xk0) 


16.4.1  Asymptotic  Covariance  Matrices 

A  model  affects  covariance  matrices  through  the  Jacobian.  Since 

d7Zi_  _  [E*  exp(XkO)][exp(Xj0)]Xij  -  [exp(s;fl)][^  xkj  expfog)] 
30  j  [E*  exp  (xk0)]2 

=  71  i  X[j  71 1  ^  ^  Xkj  71  k , 
k 

the  matrix  of  these  elements  has  the  form 

dn/d0  =  [Diag(rr)  —  jcnT]X. 

Using  this  with  (16.10)  and  (16.1 1),  the  information  matrix  at  0o  is 

AT  A  —  (dn /dOo)TDiag(jio)-\djc /d0o) 

=  Yr[Diag(nr0)  -  ir0iTo  ]rDiag(jr0)-1  [Diag(jro)  -  ir0ir l]X 
-  Yr[Diag(nr0)  -  n0nl]X. 
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Thus,  for  multinomial  loglinear  models,  0  is  asymptotically  normally  distributed  with 
estimated  covariance  matrix 

cov(0)  =  jiVr[Diag(3r)  —  n nT~\X j  '/n.  (16.22) 

Similarly,  from  ( 16. 13)  the  estimated  asymptotic  covariance  matrix  of  if  is 

cov(jt)  =  -[Diag(jf)  —  ji:7rr]JV{Xr  [Diag(jr)  —  jrjfr]Z}  'xr[Diag(jf)  -  nnT], 
From  (16. 18),  the  Pearson  residuals  e  are  asymptotically  normal  with 
asymp.  cov(e)  =  /  —  k'J2  (7[o2J  ~  A(Ar  A)~x  AT 

T 

-  I  -  n'0/2  (^i72)  -  Diag(jr0)_l/2[Diag(jr())  -  Jt07il]X 

x{JKr[Diag(s0)  -  7r0?To  ]AT }  1 
xJ¥r[Diag(7r0)  -  7toJrJ]Diag(jr0rl/2. 


16.4.2  Connection  with  Poisson  Loglinear  Models 

This  book  expressed  loglinear  models  in  terms  of  Poisson  expected  cell  frequencies  n  = 
(in, . . . ,  iin)t  ,  using  formulas  of  the  form 

log  ii  =  Xa0a.  (16.23) 

The  model  matrix  Xa  and  parameter  vector  0a  in  this  formula  are  slightly  different  from 
X  and  0  in  multinomial  model  (16.22).  The  Poisson  expression  (16.24)  does  not  have 
constraints  on  n.  For  multinomial  model  (1 6.22),  JT  /x,  =  n  is  fixed,  and  n  =  n/n  satisfies 

log  n  =  log  tm  =  X0  +  [logn  -  log(l'  explX^))]  1 

=  xo  +  n. 


where  X  =  logn  —  log(l'  exp(2(0))].  In  other  words,  multinomial  model  (16.22)  implies 
Poisson  model  (16.24)  with 


Xa  =  [\:X]  and  0a=(X,0T)T. 

The  columns  of  X  in  the  multinomial  representation  must  be  linearly  independent  of  1; 
that  is,  the  parameter  X,  which  relates  to  the  total  sample  size,  does  not  appear  in  6.  The 
dimension  of  0  is  1  less  than  the  number  of  parameters  reported  in  this  text  for  Poisson 
loglinear  models.  For  instance,  for  the  saturated  model,  0  has  N  —  1  elements  for  the 
multinomial  representation,  reflecting  the  sole  constraint  on  n  of  tt,  =  1 . 
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16.5  SMALL-SAMPLE  SIGNIFICANCE  TESTS  FOR 
CONTINGENCY  TABLES 

With  modern  computational  power,  it  is  not  necessary  to  rely  on  large-sample  approxima¬ 
tions  when  n  is  small  or  when  there  is  a  large  number  of  parameters.  For  many  cases,  tests 
and  confidence  intervals  can  directly  use  small-sample  distributions  rather  than  normal  and 
chi-squared  approximations.  We  studied  small-sample  methods  in  Sections  3.5  and  7.3, 
such  as  Fisher’s  exact  test  for  testing  independence.  We  next  address  this  more  generally 
for  inference  in  contingency  tables.1 


16.5.1  Exact  Conditional  Distribution  for  I  x  J  Tables 
Under  Independence 

We  first  derive  the  distribution  used  in  exact  conditional  tests  of  independence  for  /  x  J 
tables.  We  assume  independent  multinomial  sampling  within  rows,  as  often  applies  in 
comparing  /  treatment  groups.  Then  row  totals  {«,+  }  are  fixed,  and  we  estimate  the  1 
conditional  distributions  {jtj\i,  j  =  1, . . . ,  J}.  Under  Ho-  independence  (i.e.,  homogene¬ 
ity),  Ttj\\  =  7T yj2  =  •  •  •  =  itj\i  —  7T+j,  for  j  =  1, . . . ,  J .  The  product  of  the  /  multinomial 
probability  functions  then  simplifies  to 


(rw)(n,»;y) 

nrw 


(16.24) 


This  distribution  for  {«/,  )  depends  on  {7r+y }.  These  are  nuisance  parameters,  since  they  do  not 
describe  the  association.  Fisher  proposed  eliminating  nuisance  parameters  by  conditioning 
on  their  sufficient  statistics.  From  the  definition  of  sufficiency,  the  resulting  conditional 
distribution  does  not  depend  on  those  parameters. 

The  contribution  of  {jr+y}  to  the  product  multinomial  distribution  (16.25)  depends  on  the 
data  only  through  {n+j},  which  are  their  sufficient  statistics.  The  {«+y }  have  the  multinomial 
(«,  {7r+y})  distribution,  namely, 


n\ 

FI  ,-n+y! 


(16.25) 


The  joint  probability  function  of  {n,y }  and  (ra+y)  is  identical  to  the  probability  function  of 
{rc,y},  since  {n,y}  determines  {«+/}.  Thus,  the  probability  function  of  {/j,y},  conditional  on 
{«+/  },  equals  the  probability  function  (16.25)  of  {«,yj  divided  by  the  probability  function 
(16.26)  evaluated  at  {«+y}.  This  gives  cell  probabilities 


P({n,j}\{ni+},  {«+,})  = 


ow)  (n  j n +j  •) 
n,  Y\jnir 


(16.26) 


1  Most  analyses  in  Sections  16.5  and  16.6  can  be  implemented  with  StatXact  (Cytel  Software)  and/or  R  and  SAS 
routines  described  at  the  text  website. 
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This  multivariate  hypergeometric  distribution  applies  to  the  set  of  {/?,-,•}  having  the 
same  {/?,+}  and  [n+J }  as  the  observed  table.  For  2x2  tables,  it  is  the  hypergeometric 
distribution  (3.17).  When  a  table  has  a  single  multinomial  sample,  the  unknown  parameters 
are  { 7Ty } .  For  testing  independence  (7 r,y  =  7 r,+  jt+j  all  i  and  /),  distribution  (16.27)  results 
from  conditioning  on  the  row  and  column  totals.  These  are  sufficient  statistics  for  { 7r,  + } 
and  [n+j],  which  determine  the  null  distribution.  For  either  sampling  model,  both  sets 
of  margins  are  fixed  after  the  conditioning.  The  end  result  (16.27)  does  not  depend  on 
unknown  parameters  and  thus  permits  exact  probability  calculations. 


16.5.2  Exact  Tests  of  Independence  for  I  x  J  Tables 

Exact  tests  of  independence  for  /  x  J  tables  utilize  the  multivariate  hypergeometric  dis¬ 
tribution.  Freeman  and  Halton  (1951)  defined  the  P- value  as  the  probability  of  the  set  of 
tables  with  the  given  margins  that  are  no  more  likely  to  occur  than  the  table  observed. 
Other  exact  tests  order  the  tables  using  a  statistic  describing  distance  from  Hq.  Yates  ( 1 934) 
used  X2.  The  P-value  is  then  the  null  value  of  P(X2  >  X2)  for  observed  value  X2,  that 
is,  the  sum  of  the  multivariate  hypergeometric  probabilities  (16.27)  for  all  tables  with  the 
given  margins  that  have  X2  at  least  as  large  as  observed.  When  classifications  have  ordered 
categories,  an  ordinal  statistic  is  more  relevant.  For  the  one-sided  alternative  hypothesis  of 
a  positive  association,  we  could  use  P{T  >  i„),  where  T  is  the  correlation  or  gamma  and  t0 
is  its  observed  value. 

Algorithms  and  software  for  exact  tests  for  I  x  J  tables  are  widely  available  (e.g., 
Mehta  and  Patel  1983).  We  recommend  these  tests  when  asymptotic  approximations  may 
be  invalid.  Computing  time  increases  exponentially  as  n,  /,  or  J  increase.  However,  we 
can  use  Monte  Carlo  to  sample  randomly  from  the  set  of  tables  with  the  given  margins 
(Agresti  et  al.  1979).  The  estimated  P-  value  is  then  the  sample  proportion  of  tables  having 
test  statistic  value  at  least  as  large  as  the  value  observed. 

As  /  and/or  J  increase,  the  number  of  possible  values  for  any  test  statistic  T  tends  to 
increase.  Thus,  the  conservativeness  issue  for  conditional  tests  discussed  in  Section  3.5.5 
becomes  less  problematic. 


16.5.3  Example:  Sexual  Orientation  and  Party  ID 

We  illustrate  an  exact  test  with  Table  16.1,  which  cross-classifies  sexual  orientation  with 
political  party  ID  for  subjects  of  age  18-35  in  the  2010  GSS.  The  first  two  rows  contain 
many  small  counts,  and  large-sample  tests  of  independence  may  be  inappropriate.  With 


Table  16.1  Sexual  Orientation  by  Political  Party  ID 


Sexual 

Orientation 

Political  Party  ID 

Strong 

Dem. 

Dem. 

Indep. 
near  Dem. 

Indep. 

Indep. 
near  Repub. 

Repub. 

Strong 

Repub. 

Homosexual 

1 

3 

3 

0 

0 

1 

0 

Bisexual 

4 

2 

2 

7 

0 

0 

0 

Heterosexual 

59 

109 

78 

105 

55 

75 

29 

Source:  2010  General  Social  Survey. 
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small  counts,  the  chi-squared  approximation  tends  to  be  better  with  X2  than  G2.  The  value 
X2  =  19.18  (df  =  12)  has  P-value  of  0.084. 

Conditional  on  both  sets  of  margins,  using  X2  as  the  test  criterion,  the  null  probability 
of  the  observed  table  and  the  more  extreme  tables  [based  on  formula  (16.27)]  equals  0.080. 
So,  in  this  case  the  large-sample  test  performed  fine. 

Alternatively,  treating  rows  and  columns  as  ordinal,  we  could  use  an  ordinal  statistic  to 
order  the  tables,  potentially  giving  greater  power  and  permitting  one-sided  tests.  Using  the 
correlation  with  row  and  column  numbers  as  the  scores,  the  sample  correlation  is  0.095.  The 
exact  P- value  is  0.030  for  the  two-sided  alternative  and  0.014  for  the  negative  association 
alternative,  suggesting  that  heterosexuals  are  more  likely  to  be  Republican.  The  evidence 
is  stronger  than  using  X2,  which  ignores  the  ordering  of  categories. 


16.6  SMALL-SAMPLE  CONFIDENCE  INTERVALS  FOR 
CATEGORICAL  DATA 

We  next  consider  small-sample  interval  estimation.  For  a  given  test  about  a  particular 
parameter  6 ,  a  100(1  —  a)%  test-based  confidence  interval  (Cl)  for  6  consists  of  all  6q  for 
which  P-values  exceed  a  in  the  test  of  Hq:  0  =  0q. 

16.6.1  Small-Sample  CIs  for  a  Binomial  Parameter 

We  first  consider  a  binomial  parameter  n.  In  Section  1 .4.4  we  tested  Hq:  n  —  txq  directly 
using  the  binomial  distribution.  The  best  known  small-sample  interval,  proposed  by  Clopper 
and  Pearson  (1934),  uses  the  tail  method  for  forming  confidence  intervals.  It  consists  of  all 
7ro  values  for  which  each  one-sided  exact  binomial  P-value  exceeds  a/2.  With  binomial 
outcome  Y  =  y  in  n  trials,  the  lower  and  upper  endpoints  are  the  solutions  in  n{)  to  the 
equations 

X!  ( l  ( 1  ~  »(>)-*  =  a/2  and  1  “  no)"~k  =  a/2, 

k=y  '  '  k=  0  '  ' 

except  that  the  lower  bound  is  0  when  y  =  0  and  the  upper  bound  is  1  when  y  =  n.  When 
y  —  1,  2, 1,  from  connections  between  binomial  sums  and  the  incomplete  beta 
function  and  related  cdf ’s  of  beta  and  F  distributions,  the  confidence  interval  is 

{  |  n  -  y  +  1  <  r  _ n-y _ 

,y^2.v.2(fl— v+l)0  —  a/2)_  L  (y  +  l)/r2(>'+D,2(n-v)(a/2). 

where  Faj,(c )  denotes  the  1  —  c  quantile  from  the  F  distribution  with  df i  =  a  and  df2  =  b. 
This  interval  corresponds  to  inverting  a  binomial  two-sided  test  for  which  the  F- value  is 
double  the  minimum  of  the  one-sided  P-values. 

In  principle  this  approach  seems  ideal.  However,  there  is  a  complication:  Because  of 
discreteness,  the  actual  probability  that  the  confidence  interval  contains  the  value  of  n  is 
>(1  —  a)  rather  than  exactly  (1  —  a)  (Neyman  1935,  Casella  and  Berger  2001.  p.  434). 
Similarly,  for  a  test  of  Hq:  n  =  7To  at  a  fixed  desired  size  a  such  as  0.05,  it  is  not  usually 
possible  to  achieve  that  size.  With  a  finite  number  of  possible  samples,  there  is  a  finite 
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n 


Figure  16.1  Plot  of  coverage  probabilities  for  nominal  95%  confidence  intervals  for  binomial  parameter  n 
when  n  =  25. 


number  of  possible  /’-values,  of  which  0.05  may  not  be  one.  In  testing  //0  with  fixed  7To, 
we  can  pick  a  particular  a  that  can  occur  as  a  /’-value.  For  interval  estimation,  however, 
this  is  not  an  option.  This  is  because  constructing  the  interval  corresponds  to  inverting 
an  entire  range  of  jro  values  in  Ho:  n  —  no,  and  each  distinct  no  value  can  have  its 
own  set  of  possible  /’-values;  that  is,  there  is  not  a  single  null  parameter  value  7To  as  in 
one  test. 

The  actual  coverage  probability  can  be  much  larger  than  the  nominal  confidence  level. 
When  n  =  25,  Figure  16.1  plots  the  coverage  probabilities  as  a  function  of  the  true  parameter 
value  n,  for  the  Clopper-Pearson  method,  the  large-sample  score  method,  and  the  Wald 
method.  At  a  fixed  n  value  with  a  given  method,  the  coverage  probability  is  the  sum  of  the 
binomial  probabilities  of  all  those  samples  for  which  the  resulting  interval  contains  that  n . 
With  n  =  25,  there  are  26  possible  samples  and  26  corresponding  confidence  intervals,  so 
the  coverage  probability  is  a  sum  of  somewhere  between  0  and  26  binomial  probabilities. 
As  n  moves  from  0  to  1,  this  coverage  probability  jumps  up  or  down  whenever  n  moves 
into  or  out  of  one  of  these  26  intervals.  Figure  16.1  shows  that  coverage  probabilities  are 
too  low  for  the  Wald  method,  whereas  the  Clopper-Pearson  method  errs  in  the  opposite 
direction.  The  score  method  behaves  well,  its  coverage  probabilities  tending  to  be  near  the 
nominal  level,  except  for  some  n  values  close  to  0  or  1.  This  is  a  good  method  even  with 
relatively  small  n,  unless  n  is  near  0  or  1  (see  Exercise  16.32). 

In  discrete  problems  using  small-sample  distributions,  shorter  confidence  intervals  result 
from  inverting  a  single  two-sided  test  rather  than  two  one-sided  tests  as  the  Clopper-Pearson 
method  does.  For  the  binomial  parameter,  see  Sterne  (1954),  Blyth  and  Still  (1983),  and 
Blaker  (2000)  for  methods,  summarized  by  Agresti  and  Min  (2001)  and  Fay  (2010a,b). 

With  the  Sterne  approach,  for  observed  outcome  ya,  the  test  of  Hq\  n  =  no  has 
P- value  that  sums  up  P„a(y)  for  all  outcomes  y  with  Pm,(y)  <  Px0(yo)-  This  leads  to 
optimality  in  terms  of  minimizing  total  length.  In  letters  to  the  editor  of  Applied  Statistics, 
Mantel  and  Halperin  criticized  this  method  in  1981  (pp.  73-74),  but  Barnard  criticized 
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their  criticism  in  1982  (pp.  304-305).  With  Blaker’s  approach  the  P-value  is  the  mini¬ 
mum  one-tail  probability  plus  an  attainable  probability  in  the  other  tail  that  is  as  close 
as  possible  to,  but  not  greater  than,  that  one-tailed  probability.  This  can  be  expressed  as 
P[Qna(Y)  <  Q^o)]  with  Qni)(y)  =  mintP^T  >  y),  Pno{Y  <  y)].  Its  P- value  cannot  be 
greater  than  the  Clopper-Pearson  P-value.  Thus,  the  corresponding  interval  is  contained 
in  the  Clopper-Pearson  interval  and  is  preferable  to  it.  The  Blaker  and  Sterne  intervals 
both  have  the  nestedness  property  that  an  interval  with  larger  confidence  level  necessarily 
contains  one  with  a  smaller  level.  However,  they  have  inconsistencies,  such  as  for  certain 
data  configurations  having  P-value  that  increases  when  an  observation  is  added  to  the  data 
set,  regardless  of  its  value  (Fay  2010a,  Vos  and  Hudson  2008). 


16.6.2  CIs  Based  on  Tests  Using  the  Mid  P-  Value 

In  Section  1.4.4  we  adjusted  for  discreteness  in  small-sample  distributions  by  basing  in¬ 
ference  on  the  mid  P-value.  For  a  statistic  T  with  observed  result  t0  for  which  larger 
results  more  strongly  contradict  Hq,  this  is  j  P(T  =  t0)  +  P(T  >  t0),  less  than  the  ordinary 
P-value  of  P(T  >  t0).  Less  conservative  confidence  intervals  invert  tests  using  the  exact 
distribution  with  a  mid  P-value. 

As  with  the  ordinary  P-value,  there  are  various  ways  we  could  construct  the  interval. 
We’ll  illustrate  for  the  binomial  parameter.  First,  we  could  mimic  the  Clopper-Pearson 
construction  (16.6.1)  but  replace  each  tail  sum  by  the  corresponding  one-sided  mid-P  tail 
sum.  This  corresponds  to  inverting  a  test  such  that  the  95%  confidence  interval  is  the  set  of 
7r  values  for  which  double  the  minimum  of  the  one-sided  mid  P- values  exceeds  0.05,  and  it 
performs  well  while  tending  to  be  slightly  conservative  (Agresti  and  Gottard  2007).  Brown 
et  al.  (200 1 )  showed  that  this  interval  is  similar  to  the  Bayesian  posterior  interval  generated 
with  the  Jeffreys  prior  distribution  (beta  with  parameters  0.5  and  0.5).  That  interval  has 
actual  coverage  probability  close  to  the  nominal  level. 

Another  possible  approach  mimics  the  Blaker  approach.  It  inverts  the  test  for  which  the 
P-value  is  the  minimum  one-sided  mid  P-value  plus  the  mid-P  probability  in  the  other  tail 
that  is  as  close  as  possible  to  that  but  no  greater  than  it.  An  approach  that  mimics  the  Sterne 
interval  inverts  the  test  for  which  the  P-value  is  the  sum  of  Pnu{y)  for  all  outcomes  y  with 
PnQ(y)  <  Pt(j(}’o)  added  to  half  the  probability  of  y  such  that  P„0(y)  =  PTo (>’„).  Yet  another 
approach  would  use  the  ordinary  mid  P-value  with  a  statistic  T,  such  as  T  =  z2  for  the 
score  statistic  (1.11). 


16.6.3  Example:  Proportion  of  Vegetarians  Revisited 

In  Section  1.4.3  we  estimated  the  proportion  n  of  vegetarians  in  a  population  for  which  a 
sample  of  size  n  =  25  had  y  =  0  vegetarians.  The  95%  confidence  intervals  from  inverting 
large-sample  tests  were  (0,  0)  for  the  Wald  test,  (0,  0.074)  for  the  likelihood-ratio  test,  and 
(0,  0.133)  for  the  score  test. 

For  comparison,  the  Clopper-Pearson  95%  interval  for  n  is  (0.0, 0. 1 37).  This  means  that 
if  we  tested  Hq:  tc  =0.137  against  Ha  :  tc  <  0. 137  and  observed  y  =  0  in  n  =  25  trials,  the 
binomial  P-value  =  P(Y  <  0)  =  (1  — 0.137)25  =  0.025.  The  95%  interval  using  the  Blaker 
method  is  (0.0, 0. 128).  The  95%  interval  using  the  mid-P  adaptation  of  the  Clopper-Pearson 
method  is  (0.0,  0.1 13),  which  means  that  ^P(T  <  0)  =  |(1  —  0.1 13)25  =  0.025. 
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16.6.4  Small-Sample  CIs  for  Odds  Ratios 

To  construct  a  small-sample  confidence  interval  for  the  odds  ratio,  we  can  use  a  nonnull 
exact  conditional  distribution.  For  multinomial  sampling,  the  distribution  of  (n,y)  depends 
on  n  and  cell  probabilities  {tt,,}.  For  2x2  tables,  the  odds  ratio  is 

q  _  71  \\  n22  _  TTl  1  (1  —  7Tl+  —  7T+I  +  TZ\\) 

7T12JT21  (Tt\+  —  TZ\\)(7l+\  —  TZ\\) 


Hence,  7Tn  is  a  function  of  9  and  {7T|+,  tt+i  ).  The  same  argument  applies  to  any  7T,y, 
so  the  multinomial  distribution  of  {«,y}  can  use  parameters  {#, 7Ti+,  7i+  i).  Conditional  on 
(/ti+,n+i),  the  distribution  of  {n,y}  depends  only  on  0.  Since  n\\  determines  all  other 
cell  counts,  given  the  marginal  totals,  the  conditional  distribution  of  {n,y}  is  specified 
by  some  function  P(n\  \  =  t)  =  f(t\n\+,n+\,n,6).  This  distribution  is  the  noncentral 
hypergeometric  introduced  in  (7.9), 


f(t\n\+,  n+\,  n\6) 


(16.27) 


The  conditional  ML  estimate  of  9  is  the  value  of  0  that  maximizes  probability  (16.28). 
Differentiating  the  log-likelihood  with  respect  to  0  shows  that  this  estimate  satisfies  the 
equation  rt\\  =  £(«n)  in  6,  where  the  expectation  refers  to  distribution  (16.28).  This 
equation  has  a  unique  solution  6  and  is  solved  using  iterative  methods  (Cornfield  1956). 
This  estimator  differs  from  the  unconditional  ML  estimator  §  =  n  n  nji/nn  «2i  >  which 
uses  the  ML  estimates  of  {jr,y}  for  the  multinomial  distribution  of  {n,y}. 

A  confidence  interval  for#  results  from  inverting  the  test  of  Ho'-  0  —  6$,  having  observed 
n  1 1  —  t0.  For  Ha- 0  >  #o  and  for  Ho'-  0  <  do,  the  ^-values  are 


P  = 


£/(,;«,+, n+i,n,#o)  and  P  =  ^  f(P,nl+,  n+i,  n,  dp). 

t>tu  t<[„ 


When  #q  =  1,  these  are  one-sided  Fisher’s  exact  tests.  Mimicking  the  Clopper-Pearson 
approach,  Cornfield  (1956)  set  the  lower  endpoint  as  6o  for  which  P  =  a/2  in  testing 
against  Ha:  9  >  6o  and  the  upper  endpoint  as  Do  for  which  P  =  a/2  for  Ha:  6  <  #o-  The 
interval  is  the  set  of  9o  for  which  both  one-sided  F-values  >  a/2. 

As  in  Fisher’s  exact  test,  the  conditional  approach  to  interval  estimation  is  necessarily 
conservative  because  of  discreteness.  The  actual  confidence  coefficient,  defined  as  the 
infimum  of  the  coverage  probabilities  for  all  possible  6,  has  the  nominal  confidence  level 
as  a  lower  bound.  Less  conservative  behavior  and  shorter  intervals  result  from  inverting  a 
single  two-sided  test  rather  than  inverting  two  one-sided  tests.  For  the  test  criterion  we  could 
use  the  chi-squared  score  statistic  for  testing  Ho-  9  =  9o,  but  utilize  the  exact  conditional 
distribution  to  obtain  the  P- value  (Agresti  and  Min  2001).  Or,  we  could  invert  the  test  that 
has  P-value  equal  to  the  minimum  one-tail  probability  plus  an  attainable  probability  in  the 
other  tail  that  is  as  close  as  possible  to,  but  not  greater  than,  that  one-tailed  probability 
(Blaker  2000,  Fay  2010).  Or,  we  could  invert  the  test  that  has  P-value  that  sums  the 
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probabilities  that  are  no  greater  than  the  probability  of  the  observed  table  (Baptista  and 
Pike  1977). 

Using  instead  mid  P- values  to  invert  hypergeometric  tests  of  0  =  0q  yields  narrower 
intervals  with  coverage  probability  usually  nearer  the  nominal  level,  but  not  having  that 
level  as  a  lower  bound.  For  interval  estimation  of  the  odds  ratio,  this  method  tends  to  be 
a  bit  conservative,  but  for  small  samples  can  yield  much  shorter  intervals  than  Cornfield’s 
exact  interval. 

An  alternative  approach  with  independent  binomial  samples  inverts  nonnull  uncondi¬ 
tional  small-sample  tests  (Agresti  and  Min  2002,  Lin  and  Yang  2006,  Troendle  and  Frank 
2001),  an  approach  discussed  in  Section  16.6.8.  Because  of  the  reduced  discreteness,  such 
intervals  are  also  usually  shorter. 

16.6.5  Example:  Fisher’s  Tea  Taster  Revisited 

We  illustrate  with  Table  3.9  from  Fisher’s  tea-tasting  experiment,  for  which  we  illustrated 
Fisher’s  exact  test  in  Section  3.5.1.  The  conditional  ML  estimate  of  0  is  6.41.  Software 
provides  the  Cornfield  tail-method  interval  (0.2 1 , 626.24)  with  confidence  coefficient  guar¬ 
anteed  >  0.95.  Not  surprisingly,  it  is  very  wide  because  of  the  small  sample.  Inverting 
the  family  of  two-sided  exact  conditional  score  tests  gives  a  more  precise  interval,  (0.31, 
306.24).  The  unconditional  approach  is  not  appropriate  here  because  of  the  sampling  design. 

Large-sample  methods  do  not  have  the  guarantee  of  bounds  on  error  probabilities.  They 
can  be  conservative  or  liberal,  and  thus  their  results  can  appear  quite  different  from  exact 
methods.  For  example,  for  the  tea-tasting  data,  the  95%  large-sample  Wald  confidence 
interval  (3.2)  for  the  odds  ratio  is  (0.37,  220.93),  the  large-sample  score  interval  (proposed 
by  Cornfield  1956  and  by  Miettinen  and  Nurminen  1985)  without  a  continuity  correction  is 
(0.48,  168.87),  and  the  large-sample  profile  likelihood  interval  is  (0.48,  41 8.98).  Normally, 
we  would  prefer  an  exact  method  over  an  approximate  one.  When  the  conditional  distribu¬ 
tion  is  highly  discrete,  however,  the  choice  is  not  so  obvious.  Exact  methods  then  can  be 
quite  conservative,  especially  with  small  samples. 

For  highly  discrete  data,  it  seems  sensible  to  use  adjustments  of  exact  methods  based  on 
the  mid  P- value,  For  the  tea-tasting  data,  for  instance,  the  95%  confidence  interval  based 
on  inverting  two  one-sided  hypergeometric  tests  using  the  mid  P-value  is  (0.31,  308.6), 
similar  to  the  interval  obtained  by  inverting  the  small-sample  score  test. 

16.6.6  Small-Sample  CIs  for  Logistic  Regression  Parameters 

In  Section  7.3  we  used  conditional  ML  to  eliminate  nuisance  parameters  in  logistic  re¬ 
gression  models,  by  conditioning  on  their  sufficient  statistics.  We  used  this  approach  also 
in  Section  11.2  for  matched-pairs  data  and  in  more  general  contexts  for  clustered  data  in 
Section  13.1.  With  the  conditional  likelihood  we  can  use  either  large-sample  methods  or 
small-sample  exact  distributions.  With  the  latter,  we  can  conduct  exact  inference  for  logistic 
regression  parameters,  as  explained  in  Section  7.3.2  and  illustrated  for  Fisher’s  exact  test  for 
2x2  tables  in  Section  7.3.3,  a  test  for  the  effect  parameter  in  a  linear  logit  model  in  Section 
7.3.4,  and  a  test  of  conditional  independence  in  multiple  2x2  tables  in  Section  7.3.5.  For 
that  multiple  2x2  table  case,  we  now  illustrate  small-sample  confidence  intervals. 

For  logistic  model  (7.1 1)  of  homogeneous  association  in  2  x2x  K  tables,  the  ordinary 
ML  estimator  of  the  odds  ratio  0  =  exp(y8)  behaves  poorly  for  sparse-data  asymptotics.  The 
conditional  ML  estimator  maximizes  the  conditional  likelihood  function  after  reducing  the 
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parameter  space  by  conditioning  on  sufficient  statistics  for  the  other  parameters  (Andersen 
1970,  Birch  1964b).  For  cell  counts  given  [ni+k,  n+jk)  for  all  k,  the  conditional  prob¬ 
ability  mass  function  that  («ni  =  ?i, . .  ■ ,  «n*r  =  f/c)  is  the  product  of  the  hypergeometric 
functions  (16.28)  from  the  separate  strata,  or 


Q  Pin m  =  tk\n]+k,  n+lk,  n++k\0)  —  J~j 
k  k 


(T)( 


n++k  —  n  i +i 
n+\k  ~  h 


:.(r)( 


n++k  —  n\+k 
n+\k  —  u 


eh 


(16.28) 


The  conditional  ML  estimator  0  maximizes  (16.29).  Like  the  Mantel-Haenszel  estimator 
Omh,  it  has  good  properties  for  both  standard  and  sparse-data  asymptotic  cases  (Andersen 
1 970,  Breslow  1981),  since  the  number  of  parameters  does  not  change  as  K  does.  It  can  be 
slightly  more  efficient  than  $mh,  except  when  0  —  1 .0,  where  they  are  equally  efficient,  or 
for  matched  pairs,  where  they  are  identical  (Breslow  1981). 

The  conditional  distribution  (16.29)  propagates  one  for  n\\k,  which  is  used  to  test 
Hq-  0  =  0q  for  an  arbitrary  value.  Then,  a  95%  confidence  interval  for  0  consists  of  all 
0q  for  which  the  P-valuc  exceeds  0.05.  Such  an  interval  is  guaranteed  to  have  at  least  the 
nominal  coverage  probability  (Gart  1970;  Kim  and  Agresti  1995,  Mehta  et  al.  1985).  This 
extends  the  interval  for  a  single  2x2  table  presented  in  Section  16.6.4. 

Consider  the  promotion  discrimination  case  in  Table  7.5.  There,  J2k  =  0.  so  the 
lower  bound  of  any  confidence  interval  for  6  should  be  0.  For  the  generalization  to  several 
strata  of  Cornfield’s  tail-method  interval,  StatXact  software  reports  a  95%  confidence 
interval  of  (0,  1.01).  Using  mid-P-values  or  P-values  based  on  a  finer  partitioning  of  the 
sample  space  in  tests  and  related  confidence  intervals  reduces  conservativeness  (Note  3. 10). 
Inverting  exact  tests  of  Ho'.  9  —  6o  with  the  mid  P-value  yields  the  interval  (0,  0.78). 
However,  this  approach  cannot  guarantee  that  the  actual  coverage  probability  is  bounded 
below  by  0.95. 

Zelen  (1971)  presented  a  small-sample  test  of  homogeneity  of  the  odds  ratios.  Agresti 
( 1 992)  discussed  this  and  other  small-sample  methods  for  contingency  tables.  The  methods 
of  this  section  extend  to  estimating  conditional  odds  ratios  in  models  with  several  predictors, 
as  the  following  example  illustrates. 


16.6.7  Example:  Diarrhea  and  an  Antibiotic 

Table  16.2  refers  to  2493  patients  having  stays  in  a  hospital.  The  response  is  whether  they 
suffered  an  acute  form  of  diarrhea  during  their  stay.  The  three  predictors  are  age  (1  for  over 
50  years  old,  0  for  under  50),  length  of  stay  in  hospital  (1  for  more  than  1  week,  0  for  less 
than  1  week),  and  exposure  to  an  antibiotic  called  Cephalexin  (1  for  yes,  0  for  no).  We 
discuss  estimation  of  the  effect  of  Cephalexin,  adjusting  for  age  and  length  of  stay,  using  a 
logistic  model  containing  only  main-effect  terms. 

The  sample  size  is  large,  yet  relatively  few  cases  of  acute  diarrhea  occurred.  Moreover, 
all  subjects  having  exposure  to  Cephalexin  were  also  diarrhea  cases,  which  causes  an  ML 
estimate  of  oo  for  the  Cephalexin  log  odds  ratio  effect.  To  study  that  effect,  we  use  an 
exact  distribution,  conditioning  on  sufficient  statistics  for  the  other  predictors.  Constructing 
a  confidence  interval  by  inverting  the  conditional  test  for  the  parameter,  we  obtain  a  95% 
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Table  16.2  Data  for  Effect  of  Cephalexin  Antibiotic  on  Diarrhea 


Cephalexin" 

Age" 

Length 
of  Stay" 

Cases  of 
Diarrhea 

Sample 

Size 

0 

0 

0 

0 

385 

0 

0 

1 

5 

233 

0 

1 

0 

3 

789 

0 

1 

1 

47 

1081 

1 

1 

1 

5 

5 

"See  the  text  for  an  explanation  of  0  and  I . 

Source:  Based  on  study  by  E.  Jaffe  and  V.  Chang,  Cornell  Medical  Center,  reported 
in  the  Manual  for  LogXact  7.  Cambridge,  MA:  CYTEL  Software,  2005,  p.  470. 


confidence  interval  of  (19,  oo)  for  the  odds  ratio.  Assuming  that  the  main-effects  model  is 
valid,  Cephalexin  appears  to  have  a  strong  effect. 

Results  must  be  qualified  somewhat  because  no  Cephalexin  cases  occurred  at  the  first 
three  combinations  of  levels  of  age  and  length  of  stay.  In  fact,  the  first  three  rows  of 
Table  16.2  make  no  contribution  to  the  analysis  (Exercise  16.4).  The  data  actually  provide 
evidence  about  the  effect  of  Cephalexin  only  for  older  subjects  having  a  long  stay. 

16.6.8  Unconditional  Small-Sample  CIs  for  Difference  of  Proportions 

The  conditional  approach  to  eliminating  nuisance  parameters  works  for  parameters  that 
have  sufficient  statistics.  However,  reduced  sufficient  statistics  occur  only  for  models  that 
use  the  canonical  parameters  for  the  exponential  family  representation  (e.g.,  logit  for 
binomial,  log  mean  for  Poisson).  For  binary  data,  such  models  must  be  in  terms  of  the  log 
odds.  For  2x2  tables,  the  conditional  approach  can  yield  confidence  intervals  for  the  log 
odds  ratio  but  not  for  differences  or  ratios  of  proportions.  An  unconditional  approach  is 
more  complex  but  does  not  require  sufficient  statistics. 

Consider  interval  estimation  of  the  difference  of  proportions  for  independent  binomial 
samples.  We  used  the  unconditional  approach  in  Section  3.5.6  for  small-sample  testing 
of  7t\  —  7T2  =  0.  An  unconditional  confidence  interval  inverts  the  corresponding  test  of 
Hq :  Jt\  —  Ji2  =  <5(),  for  any  fixed  —1  <  <5o  <  1.  The  probability  function  for  the  table  is 
the  product  of  bin(«|,  7Ti)  and  b\n(n2,7t2)  mass  functions.  We  can  express  this  in  terms 
of  8  =  7T|  —  7t2  and  a  nuisance  parameter  A.  For  instance,  if  A  =  tt\  +  7r2,  we  substitute 
7r |  =  (A  +  <5)/2  and7T2  =  (A  —  S)/2.  For  8  =  So  and  a  fixed  value  of  A,  we  use  this  binomial 
product  to  calculate  the  probability  that  the  test  statistic  is  at  least  as  large  as  observed.  The 
P-  value  is  the  supremum  of  such  probabilities  calculated  over  all  possible  values  for  A.  This 
provides  a  family  of  tests  for  the  various  values  of  <$o-  The  confidence  interval  for  rt\  —  iti 
is  the  set  of  <5o  for  which  this  P-value  exceeds  a.  Analogous  unconditional  intervals  apply 
for  the  odds  ratio  and  relative  risk. 

This  approach  can  also  be  quite  conservative.  For  details  regarding  various  test  statistics, 
see  Chan  and  Zhang  (1999)  and  Santneret  al.  (2007).  To  reduce  the  degree  of  conservatism, 
it  is  better  to  invert  a  single  two-sided  test  than  to  invert  two  separate  one-sided  tests  (Agresti 
andMin2001).Forexample,  with«[  =  n2  =  10  and  binomial  outcomes  y\  =  5andy2  =  1. 
the  95%  confidence  interval  for  7r i  —  7t2  is  (—0.001, 0.700)  in  inverting  a  two-sided  score 
test  and  (—  0.020,  0.741)  in  inverting  two  one-sided  tests. 
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16.7  ALTERNATIVE  ESTIMATION  THEORY  FOR  PARAMETRIC  MODELS 

This  text  has  primarily  used  the  maximum  likelihood  (ML)  approach  to  inference.  This 
is,  by  far,  the  most  common  approach  for  categorical  data  analysis.  We’ve  also  presented 
the  Bayesian  approach.  Other  frequentist  paradigms  have  been  used,  however.  This  section 
discusses  some  of  them.  These  methods  have  similar  asymptotic  properties  as  ML,  so  the 
large-sample  theory  presented  earlier  in  this  chapter  applies  also  to  them. 

16.7.1  Weighted  Least  Squares  for  Categorical  Data 

Weighted  least  squares  (WLS)  is  an  extension  of  ordinary  least  squares  that  permits  re¬ 
sponses  to  be  correlated  and  to  have  nonconstant  variance.  This  and  related  quasi-likelihood 
methods  introduced  in  Sections  4.7  and  12.3  are  sometimes  simpler  to  apply  than  ML.  Fa¬ 
miliarity  with  the  WLS  method  is  useful  because: 

1.  WLS  computations  have  a  standard  form  that  is  simple  to  apply  for  a  wide  variety  of 
models. 

2.  Algorithms  for  calculating  ML  estimates  often  consist  of  iterative  use  of  WLS. 

3.  When  the  model  holds,  WLS  and  ML  estimators  are  asymptotically  equivalent,  both 
falling  in  the  class  of  best  asymptotically  normal  (BAN)  estimators. 

By  (3),  for  large  samples  the  WLS  and  ML  estimators  are  approximately  normally  dis¬ 
tributed  around  the  parameter  value,  and  the  ratio  of  their  variances  converges  to  1 .  Grizzle, 
Starmer,  and  Koch  ( 1 969)  popularized  WLS  for  categorical  data  analyses.  In  honor  of  them, 
WLS  for  such  analyses  is  often  called  the  GSK  method. 

For  a  response  variable  Y  with  /  categories,  consider  multinomial  samples  of  sizes 
n  ni  at  /  levels  of  an  explanatory  variable  or  combinations  of  levels  of  several 
explanatory  variables.  Let  it  =  (jt] ,  . . . ,  n])T ,  where 

=  (Jfi|i»  tt2\i, .  •  • ,  nj\i)T  with  =  1 

j 

denotes  the  conditional  distribution  of  Y  at  level  /.  Let  p  denote  corresponding  sample 
proportions,  with  V  their  I J  x  I J  covariance  matrix.  When  the  /  samples  are  independent, 


From  Section  16.1.4,  the  covariance  matrix  of  ^Jifi p,  is 

—  Tt\\i)  — zr  1 1,  7T2|/  —Tt\\iTtj\i 

— 7T2|,-7ri|,-  ZT  2|i'  ( 1  ZT2|* )  •••  —  1t2\ittj\i 

n'V'=  ■  :  : 

—7tj\i7t]\i  —  ZTy  |(7T2|i  TTj\j(\  —  7lj\j) 
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Each  set  of  proportions  has  (/  —  1)  linearly  independent  elements. 
Let  F  be  a  vector  of  u  <  I(J  —  1)  response  functions 


F(j r)  =  [Fi(jt),  ... ,  Fu(n)]T . 
The  WLS  approach  applies  to  linear  models  for  F  of  form 


F{n)  =  Xp, 


(16.29) 


where  p  is  a  q  x  1  vector  of  parameters  and  X  is  a  u  x  q  model  matrix  of  known  constants 
having  rank  q.  From  Section  10.5.1,  loglinear  and  logit  response  functions  are  special  cases 
of  F(n)  =  C  log( An:)  for  certain  matrices  C  and  A. 

Let  F(p )  denote  the  sample  response  functions.  We  assume  that  F  has  continuous 
second-order  partial  derivatives  in  an  open  region  containing  n .  This  assumption  enables  the 
delta  method  to  determine  the  large-sample  normal  distribution  for  F(p).  The  asymptotic 
covariance  matrix  of  F(p)  depends  on  the  u  x  I J  matrix 


Q  = 


dFk(jl)' 

d7zj\i  . 


for  k  =  1 , . . . ,  u  and  all  /./  combinations  (/,  j).  Linear  response  models  have  response 
functions  of  form  F(jc)  =  An  for  a  matrix  of  known  constants  A,  in  which  case  Q  =  A. 
For  the  generalized  loglinear  model  F(n )  —  C  log  (An)  (recall  Sections  10.5.1  and  12.1 .4), 
Q  =  C[Diag(A7r)]-1  A.  By  the  multivariate  delta  method  (Section  16.1.5),  the  asymptotic 
covariance  matrix  of  F(p )  is 


VF  =  qvqt. 

Let  VF  denote  the  sample  version  of  V  F,  substituting  sample  proportions  in  Q  and  V.  For 
subsequent  formulas,  this  matrix  must  be  nonsingular. 


16.7.2  Inference  Using  the  WLS  Approach  to  Model  Fitting 

For  the  general  model  (16.30),  the  WLS  estimate  of  ft  is 

b  =  (XTVF'X)~lXTV-F'F(p). 

This  is  the  ft  value  that  minimizes  the  quadratic  form 

[F(p)-Xp]TV-F'[F(p)-Xp]. 

The  ordinary  least-squares  estimate,  for  uncorrelated  responses  with  constant  variance, 
results  when  V  F  is  a  constant  multiple  of  the  identity  matrix.  The  WLS  estimator  has  an 
asymptotic  multivariate  normal  distribution,  with  estimated  covariance  matrix 


cov(b)  =  (XTVj'xr'- 
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The  normal  distribution  improves  as  the  sample  size  increases  and  F(p)  is  more  nearly 
normally  distributed. 

The  estimate  b  yields  predicted  values  F  =  Xb  for  the  response  functions.  When  the 
model  holds,  F  is  asymptotically  better  than  F(p)  as  an  estimator  of  F(jz)  (Section  16.2.2). 
The  estimated  covariance  matrix  of  the  predicted  values  is 

Vp  =  X(XtV~f'X)-'Xt. 

The  test  of  model  goodness  of  fit  uses  the  residual  term 

W  =  [F(p)  -  Xb)TV~'[F(p)  -  Xb)  =  F(p)TV-'F(p)  -  bT(XTVF]X)b, 

which  compares  the  sample  response  functions  with  their  model  predicted  values.  Under  Ho  : 
F(it)  —  X  p  =  0  that  the  model  holds,  W  is  asymptotically  chi-squared  withdf  —  u  —  q,  the 
difference  between  the  number  of  response  functions  and  the  number  of  model  parameters. 

We  can  more  closely  check  the  model  fit  by  studying  the  residuals,  F( p)  —  F .  They  are 
orthogonal  to  the  fit  F,  so 

cov[F(p)]  =  cov{[F(p)  —  F]  +  F]  =  co\[F(p)  —  F]  +  cov(F). 

Thus,  the  estimated  covariance  matrix  of  the  residuals  equals 

co\[F(p)]  -  cov(F)  =  VF-Vp  =  VF  -  X(XTV-'X)-'XT. 

Dividing  the  residuals  by  their  standard  errors  yields  standardized  residuals  having  large- 
sample  standard  normal  distributions. 

Hypotheses  about  contrasts  and  other  effects  of  explanatory  variables  have  form 
Hq :  C  p  =  0,  where  C  is  a  known  c  x  q  matrix  with  c  <  q,  having  rank  c.  The  esti¬ 
mator  Cb  of  Cp  is  asymptotically  normal  with  mean  0  under  Ho  and  with  covariance 
matrix  estimated  by  C(XTVF]  X)~lCT .  The  Wald  statistic 

Wc  =  bTCT[C(XTVF'X)CT]~'Cb  (16.30) 

has  an  approximate  chi-squared  null  distribution  with  df  =  c.  This  statistic  also  equals  the 
difference  between  residual  chi-squared  statistics  for  the  reduced  model  implied  by  Ho  and 
the  full  model.  For  the  special  case  Hq:  Pi  =  0,  Wc  =  bflvarib,  )  has  df  =  1 . 

16.7.3  Scope  of  WLS  Versus  ML  Estimation 

The  WLS  approach  requires  estimating  the  multinomial  covariance  matrix  of  sample  re¬ 
sponses  at  each  setting  of  the  explanatory  variables.  It  is  inapplicable  when  explanatory 
variables  are  continuous,  since  there  may  be  only  one  observation  at  each  such  setting.  WLS 
also  becomes  less  appropriate  as  the  number  of  explanatory  variables  increases,  since  few 
observations  may  occur  at  each  of  the  many  combinations  of  settings.  By  contrast,  in  prin¬ 
ciple,  continuous  explanatory  variables  or  many  explanatory  settings  are  not  problematic 
to  ML. 

When  a  certain  model  holds,  with  large  cell  expected  frequencies  ML  and  WLS  give 
similar  results.  Both  estimators  are  in  the  class  of  best  asymptotically  normal  estimators. 
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However,  practical  considerations  often  favor  ML  estimation.  For  example,  zero  cell  counts 
often  adversely  affect  the  WLS  approach.  The  sample  response  functions  may  then  be  ill- 
defined  or  have  a  singular  estimated  covariance  matrix. 

WLS  shares  with  quasi-likelihood  the  feature  that  inferential  results  depend  only  on 
specifying  a  model  for  the  mean  responses  and  specifying  a  variance  function  and  covariance 
structure  (here,  based  on  the  multinomial).  It  does  not  use  the  likelihood  function  for  the 
complete  distribution.  Thus,  inference  uses  Wald  methods. 

Historically,  an  advantage  of  the  WLS  approach  was  computational  simplicity.  This  is 
not  relevant  now  that  software  is  available  for  ML  analyses  and  for  extensions  of  WLS 
(e.g.,  quasi-likelihood  methods  such  as  GEE)  that  do  not  have  some  of  its  disadvantages. 
Nonetheless,  WLS  has  close  connections  with  more  sophisticated  methods.  Many  algo¬ 
rithms  for  calculating  ML  estimates  (such  as  the  Fisher  scoring  method  of  Section  4.6.4  for 
GLMs)  and  quasi-likelihood  estimates  (such  as  the  GEE  method)  iteratively  use  WLS. 


16.7.4  Minimum  Chi-Squared  Estimators 

Consider  estimation  of  tt  or  0,  assuming  a  model  n  =  n(6).  Let  0  denote  a  generic  estimator 
of  0 ,  for  which  k  =  jt{0)  estimates  tc.  The  ML  estimator  0  maximizes  the  likelihood.  It 
also  minimizes  the  deviance  statistic  G2  comparing  observed  and  fitted  proportions  (Section 
16.3.4).  Other  estimators  minimize  other  measures  of  distance  between  jt(0)  and  p. 

The  value  0  that  minimizes  the  Pearson  statistic 


X  W),  p]  =  nYJ 


lPi  -  76(g)]2 

7Ti{0) 


is  called  the  minimum  chi-squared  estimate.  It  is  simpler  to  calculate  the  estimate  that 
minimizes  the  modified  chi-squared  statistic 


[Pi  ~  ttj(0)]2 
Pi 


(16.31) 


that  replaces  the  denominator  by  the  sample  proportion.  This  minimum  modified  chi-squared 
estimate  is  the  solution  for  0  to  the  equations 


Neyman  (1949)  introduced  minimum  modified  chi-squared  estimators.  He  showed  that 
they  and  minimum  chi-squared  estimators  are  best  asymptotically  normal  (BAN)  estimators. 
When  the  model  holds,  they  are  asymptotically  equivalent  to  ML  estimators.  Under  the 
model,  different  estimation  methods  yield  nearly  identical  estimates  of  parameters  when 
n  is  large.  When  the  model  does  not  hold,  estimates  for  different  methods  can  be  quite 
different,  even  when  n  is  large.  The  estimators  converge  to  values  for  which  the  model 
gives  the  best  approximation  to  reality,  and  this  approximation  is  different  when  best  is 
defined  in  terms  of  minimizing  G 2  rather  than  minimizing  X2  or  some  other  measure. 

For  any  n,  minimum  modified  chi-squared  estimates  are  sometimes  identical  to  WLS 
estimates.  The  connection  refers  to  an  alternative  way  of  specifying  a  model,  using  a  set  of 
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constraint  equations  for  7r, 


{gj(TT\,  =  0). 

For  instance,  for  an  /  x  /  table,  the  (/  —  1  )(J  —  1)  constraint  equations 

log  Itjj  -  log  7T,,y+|  -  log  7T/+1,,  +  log7T(  +  1,y+1  =  0 

specify  the  model  of  independence.  The  number  of  constraint  equations  equals  the  residual 
df  for  the  model. 

Neyman  (1949)  noted  that  minimum  modified  chi-squared  estimates  result  from  mini¬ 
mizing 


N  .  2  N-t 

. *»> 

«  =  1  P‘  j= 1 

with  respect  to  jc,  where  the  {Ay}  are  Lagrange  multipliers.  When  the  constraint  equations 
are  linear  in  n ,  the  resulting  estimating  equations  are  linear.  Then  Bhapkar  (1966)  showed 
that  these  estimators  are  identical  to  WLS  estimators,  and  (16.32)  equals  the  WLS  residual 
statistic  (Section  16.7.1)  for  testing  model  fit.  Usually,  however,  constraint  equations  are 
nonlinear  in  it,  such  as  for  the  independence  model.  The  WLS  estimator  is  then  the 
minimum  modified  chi-squared  estimator  based  on  a  linearized  version  of  the  constraints, 

gj(p)  +  (*  -  p)Tdgj(it)/dit  =  0, 

with  differential  vector  evaluated  at  p. 

Berkson  (1944,  1955,  1980)  was  a  strong  advocate  of  minimum  chi-squared  methods. 
For  logistic  regression,  his  minimum  logit  chi-squared  estimators  minimized  a  weighted 
sum  of  squares  between  sample  logits  and  linear  predictions.  Mantel  (1985)  criticized 
such  methods,  noting  that  their  consistency  requires  group  sizes  to  grow  large,  whereas 
ML  (or  conditional  ML,  when  there  are  many  nuisance  parameters)  is  consistent  however 
information  goes  to  the  limit  (see  also  Exercise  16.38). 

16.7.5  Minimum  Discrimination  Information 

Kullback  (1959)  formulated  estimation  by  minimum  discrimination  information  (MDI). 
The  discrimination  information  for  two  probability  vectors  it  and  y  is 

N 

I(tt;y)  =  '^27Ti\og(7Ti/yi).  (16.32) 

1  =  1 

This  directed  Kullback-Leibler  distance  measure  between  it  and  y  is  nonnegative,  equaling 
0  only  when  it  —  y.  Gokhale  and  Rollback  (1978)  studied  MDI  estimates  that  minimize 
I(it;  y),  subject  to  model  constraints,  using  y  =  p  for  some  problems  and  y  with  y\  =  yi  = 
•  •  •  =  Yn  =  l/N  for  others.  Good  (1963)  conducted  related  work  in  the  area  of  maximum 
entropy. 
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In  some  cases  with  {y,  =  1  /A^ },  the  MDI  estimator  is  identical  to  the  ML  estimator 
(Simon  1973).  With  y  —  p  it  is  not  ML,  but  it  has  similar  asymptotic  properties,  being 
BAN.  Then  Gokhale  and  Kullback  recommended  testing  goodness  of  fit  using  twice  the 
minimized  value  of  /(tr;  p ).  This  statistic  reverses  the  roles  of  p  and  n  relative  to  G2,  much 
aS*mod  in  (16.32)  reverses  their  roles  relative  to  X2.  Both  statistics  fall  in  the  class  of  power 
divergence  statistics  (Cressie  and  Read  1984,  Exercise  1.34)  and  have  similar  asymptotic 
properties.  More  generally,  we  could  choose  any  member  of  the  power  divergence  statistics 
and  define  estimates  to  be  the  values  minimizing  it.  Under  regularity  conditions,  they  are 
all  BAN. 


NOTES 

Section  16.1:  Delta  Method 

16.1  Delta  method  generalized:  For  details  of  large-sample  theory  for  categorical  data,  including 
the  delta  method,  see  Bishop  et  al.  (1975,  Chap.  14).  In  applying  the  delta  method  to  a 
function  g  of  an  asymptotically  normal  random  vector  T„,  suppose  that  the  first-order, 
. . . ,  (a  —  1  )st-order  differentials  of  the  function  are  zero  at  0,  but  the  ath-order  differential  is 
nonzero.  A  generalization  of  the  delta  method  implies  that  na/2[g(T„)  —  g(0)]  has  limiting 
distribution  involving  products  of  order  a  of  components  of  a  normal  random  vector.  When 
a  =  2,  the  limiting  distribution  is  a  quadratic  form  in  a  multivariate  normal  vector,  which 
often  relates  to  a  chi-squared  variable;  in  the  univariate  case,  it  is  o2[g"{0)]/2  times  a  xf 
variable  (Casella  and  Berger  2001,  p.  244). 

16.2  Higher-order  asymptotics:  Higher-order  asymptotic  methods  such  as  saddlepoint  approx¬ 
imations  improve  on  first-order  normal  approximations.  When  there  are  many  nuisance 
parameters,  modified  profile  likelihood  functions  are  useful.  See  Brazzale  et  al.  (2007), 
Brazzale  and  Davison  (2008),  Davison  et  al.  (2006),  Pierce  and  Peters  (1992),  and  Straw- 
derman  and  Wells  (1998). 

16.3  Bootstrap/jackknife:  Resampling  methods  such  as  the  jackknife  and  the  bootstrap  are 
alternative  tools  for  estimating  standard  errors  and  obtaining  confidence  intervals.  They  can 
be  helpful  when  use  of  the  delta  method  is  questionable — for  instance,  for  small  samples, 
highly  sparse  data,  or  complex  sampling  designs.  For  details,  see  Davison  and  Hinkley 
(1997)  and  Fay  (1985). 


Section  16.3:  Asymptotic  Distributions  of  Residuals  and  Goodness-of-Fit  Statistics 

16.4  Asymptotic  theory  and  sparseness:  Andersen  (1980),  Bishop  et  al.  (1975),  Cox  (1984), 
Haberman  (1974a),  and  Watson  (1959)  provided  other  proofs  or  considered  related  cases 
to  the  Pearson-Fisher-Cramer-Rao-Birch  results.  Haberman  (1988)  showed  that  large- 
sample  results  for  X2  break  down  with  nonstandard  asymptotics,  such  as  when  the  number 
of  cells  N  grows  as  n  -»  oo  or  when  different  expected  frequencies  grow  at  different 
rates. 

16.5  Freedman-Tukey  gof:  If  Y  is  Poisson,  then  for  large  p  the  delta  method  implies  \fY 
is  approximately  normal  with  standard  deviation  This  motivates  the  Freeman-Tukey 
goodness-of -fit  statistic,  FT  =  4  J2(sfyj  —  V/E7)2.  When  the  model  holds,  FT  —  X2  is  also 
op(  1)  as  n  — >■  oo  (Bishop  et  al.  1975,  p.  514). 

16.6  Noncentrality:  Drost  et  al.  (1989)  gave  noncentral  approximations  using  other  sequences 
of  alternatives  than  the  local  and  fixed  ones  (16.20)  and  (16.21). 
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Section  16.5:  Small-Sample  Significance  Tests  for  Contingency  Tables 

16.7  Exact  tests:  For  exact  treatment  of  /  x  J  tables,  see  Mehta  and  Patel  (1983).  For  ordered 
categories,  see  Agresti  et  al.  (1990).  For  Monte  Carlo  estimation  of  exact  P-values,  see 
Agresti  et  al.  (1979),  Booth  and  Butler  (1999),  Diaconis  and  Sturmfels  (1998),  Forster  et  al. 
(1996),  and  Patefield  (1982).  Gail  and  Mantel  (1977)  and  Good  (1976)  gave  approximate 
formulas  for  the  number  of  tables  having  certain  fixed  margins.  Freidlin  and  Gastwirth 
( 1 999)  extended  the  unconditional  approach  to  tests  for  trend  in  /  x  2  tables  and  conditional 
independence  in  several  2x2  tables. 

16.8  Ancillarity:  Suppose  that  ( 0 ,  X)  has  minimal  sufficient  statistic  (T,  U ),  where  X  is  a  nuisance 
parameter.  Cox  and  Hinkley  (1974,  p.  35)  defined  U  to  be  ancillary  for  0  if  its  distribution 
depends  only  on  X,  and  the  distribution  of  T  given  U  depends  only  on  0.  For  2x2  tables  with 
odds  ratio  0  and  X  =  (7T|  +  ,  jt+i),  let  T  =  and  U  =  (n\  +  ,  n+\).  Then  U  is  not  ancillary, 
because  its  distribution  depends  on  0  as  well  as  X.  Using  a  definition  due  to  Godambe, 
Bhapkar  (1989)  referred  to  the  marginals  U  as  partial  ancillary  for  0.  This  means  that  the 
distribution  of  the  data,  given  U,  depends  only  on  0 ,  and  that  for  fixed  0,  the  family  of 
distributions  of  U  for  various  X  is  complete.  Liang  (1984)  gave  an  alternative  definition 
referring  to  conditional  and  unconditional  inference  being  equally  efficient. 

16.9  Randomized  tests,  fuzzy  inference:  For  discrete  data,  it  is  possible  to  achieve  exactly 
a  desired  size  by  using  a  randomized  decision  on  the  boundary  of  the  critical  region.  To 
construct  a  confidence  interval  that  achieves  exactly  (a  priori)  a  desired  coverage  probability, 
we  can  invert  such  randomized  tests  (Stevens  1950).  In  practice,  this  approach  is  not  used 
because  of  the  undesirability  of  inferential  conclusions  being  determined  by  a  random 
number,  but  see  Agresti  and  Gottard  (2007)  for  details.  Geyer  and  Meeden  (2005)  proposed 
a  related  fuzzy  inference  approach  consisting  of  a  graphical  portrayal  of  all  such  possible 
randomized  confidence  intervals. 


Section  16.7:  Alternative  Estimation  Theory  for  Multinomial  Models 

16.10  WLS,  MDI:  For  details  about  the  WLS  approach,  see  Imrey  (201 1),  Imrey  et  al.  (1981),  and 
Koch  et  al.  (1985).  For  discussion  of  minimum  chi-squared  methods,  see  Neyman  (1949), 
Rao(1963),  Bhapkar  ( 1966),  and  Koch  et  al.  (1985).  For  more  about  minimum  discrimination 
information,  see  Ireland  and  Kullback  (1968ab),  Ireland  et  al.  (1969),  Ku  et  al.  (1971),  and 
Gokhale  and  Kullback  (1978). 


EXERCISES 

Applications 

16.1  An  advertisement  by  Schering  Corp.  for  the  allergy  drug  Claritin  mentioned  that 
in  a  pediatric  randomized  clinical  trial,  symptoms  of  nervousness  were  shown  by  4 
of  188  patients  on  loratadine  (Claritin),  2  of  262  patients  taking  placebo,  and  2  of 
170  patients  on  chloropheniramine.  Conduct  an  analysis  of  whether  nervousness 
depends  on  drug. 

16.2  Consider  a  3  x  3  table  having  entries,  by  row,  of  (4,  2,  0  /  2,  2,  2  /  0,  2,  4).  Conduct 
an  exact  test  of  independence,  using  X2.  Assuming  ordered  rows  and  columns  and 
using  equally  spaced  scores,  conduct  an  ordinal  exact  test.  Explain  why  results 
differ  so  much. 
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16.3  Consider  exact  tests  of  independence,  given  the  marginals,  for  the  /  x  /  table 
having  n„  =  1  for  i  =  1 and  =  0  otherwise.  Show  that  (a)  tests  that 
order  tables  by  their  probabilities,  X2,  or  G2  have  P-value  =  1.0,  and  (b)  the 
one-sided  test  that  orders  tables  by  an  ordinal  statistic  such  as  /-  or  C  —  D  has 
P-value  =  (1// !). 

16.4  For  Table  16.2,  apply  conditional  logistic  regression  to  the  model  discussed  in 
Section  16.6.7. 

a.  Obtain  an  exact  P- value  for  testing  no  C  effect  against  the  alternative  of  a  positive 
effect.  Construct  a  95%  confidence  interval  for  the  conditional  CD  odds  ratio. 

b.  Construct  the  partial  tables  relating  C  to  D  for  the  combinations  of  levels  of 
{A,  L).  For  the  sole  partial  table  having  data  at  both  C  levels,  find  a  95% 
exact  confidence  interval  for  the  odds  ratio  and  find  an  exact  one-sided  P- value. 
Compare  to  results  using  the  entire  data  set.  Explain  why  there  is  no  contribution 
to  inference  for  tables  having  only  a  single  positive  row  total  or  a  single  positive 
column  total. 

c.  Obtain  the  ordinary  ML  fit  of  the  logistic  regression  model.  To  investigate  the 
sensitivity  of  the  estimated  C  effect,  find  the  change  in  the  estimate  and  SE 
after  adding  one  observation  to  the  data  set,  a  case  with  no  diarrhea  when 
(C,  A,  L)  =  (1,  1,  1). 

Theory  and  Methods 

16.5  Explain  why: 

a.  If  c  /  0,  cz„  has  the  same  order  as  z„;  that  is,  o(cz„)  is  equivalent  to  o(z„)  and 
0(cz„)  is  equivalent  to  0(z„). 

b.  o(y„)o(z„ )  =  o(ynz„),  0(y„)0(z„)  =  0(y„z„),  o(y„)0(z„)  -  o(ynz„). 

16.6  If  X2  has  an  asymptotic  chi-squared  distribution  with  fixed  df  as  n  — >  oo,  then 

explain  why  X2 /n  —  ( 1 ) . 

16.7  a.  Use  Tchebychev’s  inequality  to  show  that  if  E(X„)  =  fxn  and  var(X„)  =  er2  < 

oo,  then  (X„  -  /xn)  =  Op{an). 

b.  Suppose  that  Y\, . . . ,  Y„  are  iid  with  E ( Y, )  =  fx  and  var(T,)  =  a2.  For  Y„  = 
T,- )/n,  apply  (a)  to  show  that  Y„  —  /x  —  Op(n~x/2). 

16.8  Let  Y  be  a  Poisson  random  variable  with  mean  fx. 

a.  For  a  constant  c  >  0,  show  that 

£[log(T  +  r  )|  =  logpi  +  (c  -  5)  /fx  +  0(/lU2). 

[Hint:  Note  that  log(T  +  c)  —  log  /x  +  log[l  +  (Y  +  c  —  fx)/fx].] 

b.  For  independent  Poisson  cell  counts  in  a  2  x  2  table,  use  (a)  to  argue  that  the 
sample  log  odds  ratio  after  adding  j  to  each  cell  is  a  sensible  estimator  for 
reducing  bias  in  estimating  the  log  odds  ratio. 
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16.9  Let  p  —  y/n  for  a  binomial  variate  y.  Find  the  asymptotic  distribution  of  the 
estimator  [p(l  —  p)]l/2  of  the  standard  deviation.  What  happens  when  n  =  0.5? 

16.10  Suppose  T„  has  a  Poisson  distribution  with  mean  X  —  np,  for  fixed  p  >  0.  For  large 
n ,  show  that  log  7],  is  approximately  normal  with  mean  log(X)  and  variance  A.-1. 
[Hint:  By  the  central  limit  theorem,  T„/n  is  approximately  N{p,  p/n)  for  large  n.] 


16.11  Refer  to  the  previous  exercise. 

a.  If  T„  is  Poisson,  show  ^/1\,  has  asymptotic  variance  |. 

b.  For  a  binomial  sample  proportion  p ,  show  the  asymptotic  variance  of  sin-1  {*fp) 
(with  the  angle  being  measured  in  radians)  is  [This  transformation  and  the 
one  in  (a)  are  variance  stabilizing,  producing  variates  with  asymptotic  vari¬ 
ances  that  are  the  same  for  all  values  of  the  parameter.  Traditionally,  these 
transformations  were  employed  to  make  ordinary  least  squares  applicable  to 
count  data.  See  Cochran  (1940)  for  discussion  and  ML  analyses.  Rucker  et  al. 
(2008)  showed  that,  in  reality,  this  arc  sine  transformation  does  not  have  variance 
nearly  constant  when  n  is  near  0  or  near  1  and  n  is  not  large.] 

16.12  For  a  multinomial  ( n ,  {jt,-})  distribution,  show  the  correlation  between  p,  and  pj  is 
—  [7r, 7Ty /( 1  -  7r,)(l  -  Kj )]l/2.  What  does  this  equal  when  tt,  =  1  -  Jt j  and  jzk  —  0 
for  k  ^  i,  j? 


16.13  An  animal  population  has  N  species,  with  population  proportion  it-,  of  species 
i.  Simpson’s  index  of  ecological  diversity  (Simpson  1949)  is  I(tc)  —  1  —  JT  nf. 
[Rao  (1982)  surveyed  diversity  measures.] 

a.  Two  animals  are  randomly  chosen  from  the  population,  with  replacement.  Show 
that  I(n :)  is  the  probability  they  are  of  different  species. 

b.  For  proportions  p  for  a  random  sample,  show  that  the  estimated  asymptotic 
standard  error  of  f(p)  is 


16.14  For  independent  Poisson  random  variables  [T, },  show  that  the  estimated  asymptotic 
variance  of  ai  l°g(D  is  'fL,  af  / >’/  •  [This  formula  applies  to  ML  estimators 
of  parameters  for  the  saturated  loglinear  model,  which  are  contrasts  of  [log(y,)}. 
Formula  (16.9)  yields  the  asymptotic  covariance  structure  of  such  estimators;  see 
Lee  (1977).] 


16.15  Assuming  independent  binomial  samples,  derive  the  asymptotic  standard  error  of 
the  log  relative  risk  (Section  3.1.3). 


16.16  The  sample  size  may  need  to  be  large  for  the  ordinal  measure  y  to  have  an 
approximate  normal  distribution  when  |y|  is  large.  The  Fisher-type  transform 
|  i  log[(l  +  y)/(  1  —  y)]  (Agresti  2010,  p.  217;  O’Gorman  and  Woolson  1988) 
converges  more  quickly  to  normality. 
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a.  Show  that  the  asymptotic  variance  of  |  equals  the  asymptotic  variance  of  y 
multiplied  by  (1  —  y2)-2. 

b.  Explain  how  to  construct  a  confidence  interval  for  £  and  use  it  to  obtain  one  for  y . 

c.  Show  that  f  =  \  log(C /D).  For  2x2  tables,  show  that  this  is  half  the  log  odds 
ratio. 


16.17  Let  (jr(T)  =  JL(7/  —  Jt,o)2/^/0-  Then  <t>2(p )  =  X2/n,  where  X2  is  the  Pearson 

statistic  (1.16)  for  testing  Ho'.  7r,  =  jt,o,  i  =  1 . (V,  and  n(j)2{n)  is  that  test’s 

noncentrality  when  n  is  the  true  value.  Under  H{),  why  does  the  delta  method  not 
yield  an  asymptotic  normal  distribution  for  <t>2(pY ?  (See  Note  16. 1 .) 

16.18  In  an  /  x  J  contingency  table,  let  9jj  denote  local  odds  ratio  (2. 10). 

a.  Show  that  asymp.  cov(V«  log%,  Jn  logft+i,;)  =  -[tt,”',  •  +  ;+]  ]. 

b.  Show  that  asymp.  co v(s/n  log 0,n  ^fn  log 0, +i,y+i)  =  /+1. 

c.  When  9,j  and  64*  use  mutually  exclusive  sets  of  cells,  show  that  asymp. 
co \(y/n  log  Ojj,  y/n  log  6i,k)  —  0. 

16.19  Consider  the  model  for  a  2  x  2  table:  7Tn  =  92,  n ]2  =  7t2\  =  0(1  —  9),  tt22  = 
(1  —  9)2  (Exercises  3.31  and  11.39). 

a.  Find  A  in  (16. 10)  for  this  model,  and  use  A  to  obtain  the  asymptotic  variance  of 
9.  (As  a  check,  it  is  simple  to  find  it  directly  using  the  inverse  of  — Ed2L/d9 2, 
where  L  is  the  log  likelihood.)  For  which  9  value  is  the  variance  maximized? 
What  is  the  distribution  of  9  if  6  =  0  or  9  =  1  ? 

b.  Find  the  asymptotic  covariance  matrix  of  y/nft . 

16.20  Refer  to  the  model  for  the  calf  data  in  Section  1 .5.6.  Obtain  the  asymptotic  variance 
of  ft. 

16.21  Cell  counts  {T,  }  are  independent  Poisson  random  variables,  with /z,  =  E{Y,)-  Con¬ 
sider  the  model 

log/x  =  Xa0a.  where  p  =  . . . 

Using  arguments  similar  to  those  in  Section  14.2,  show  that  the  large-sample 
covariance  matrix  of  9a  can  be  estimated  by  [A'[Diag(/t.)Ara]~l>  where  (L  is  the 
ML  estimator  of  fi. 

16.22  Use  the  delta  method,  with  derivatives  (16. 17),  to  derive  the  asymptotic  covariance 
matrix  in  (16. 18)  for  residuals.  Show  that  this  matrix  is  idempotent. 

16.23  In  some  situations,  X2  and  G 2  take  similar  values.  Explain  the  joint  influence  on 
this  event  of  (a)  whether  the  model  holds,  (b)  whether  the  sample  size  n  is  large, 
and  (c)  whether  the  number  of  cells  N  is  large. 
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16.24  Construct  X  and  0  in  multinomial  representation  (16.22)  for  the  independence 
model  for  an  7x7  table.  By  contrast,  show  Xa  for  the  corresponding  Poisson 
loglinear  model  (16.24). 

16.25  Using  (16.13)  and  ( 1 6.23),  derive  the  asymptotic  cov(jf )  for  a  multinomial  loglinear 
model. 

16.26  Consider  the  ML  estimator  A y  =  pi+p+j  of  7r,;  for  the  independence  model,  when 
that  model  does  not  hold.  Show  that  E(p,+p+j)  =  n-,+  TC+](ri  —  1  )/n  +  Ttu/n.  To 
what  does  jt,;  converge  as  n  increases? 

16.27  Let  £  denote  a  generic  measure  of  association.  For  K  independent  multinomial 

A  J  A 

samples  of  sizes  {«*},  suppose  that  y/n^i^k  —  &)  — >  N( 0,  a* )  as  rik  oo.  A 
summary  measure  is 


-  _ 

Y.k^k/^1)  ' 


a.  Show  that  zl  —  ^  +  [£2/<t2(£) ],  where 


zk 


l/2p 

nk  ft 

Ok 


V  =  E 

k 


nk(Xk  -  O2 


b.  Suppose  that  n  ->  oo  with  nk/n  -*■  pk  >  0,  k  =  1 , . . . ,  K .  State  the  asymptotic 
chi-squared  distribution  for  each  component  in  the  partitioning  in  (a).  Indicate 
the  hypothesis  that  each  tests. 


16.28  A  2  x  J  table  with  fixed  row  totals  consists  of  two  independent  multinomial  vari¬ 
ates.  Another  2x/  table,  with  fixed  column  totals,  consists  of  J  independent 
binomial  variates. 

a.  For  testing  that  the  multinomial  distributions  have  identical  parameters,  show 
that  the  null  distribution  of  («y)  given  the  sufficient  statistics  for  the  common 
unknown  parameters  has  the  multivariate  hypergeometric  form. 

b.  For  testing  that  the  binomial  distributions  have  identical  parameters,  show  that 
the  null  distribution  of  {n,j}  given  the  sufficient  statistic  for  the  common  unknown 
parameter  is  the  same  as  the  one  derived  in  (a). 


16.29  A  Monte  Carlo  scheme  randomly  samples  M  separate  7x7  tables  having  the 
observed  margins  to  approximate  Pa  =  P(X2  >  X2)  for  an  exact  test.  Let  P  be  the 
sample  proportion  of  the  M  tables  with  X2  >  X2.  Show  that  P(\P  -  P0 1  <B)  = 
1  —  a  requires  that  M  %  22/27>0(l  -  P0)/B2. 

16.30  Exercise  1 .26  showed  LR  and  score  confidence  intervals  for  a  binomial  sample  of 
size  n  with  y  —  0. 

a.  For  the  Clopper-Pearson  approach,  show  that  the  upper  bound  is  1  —  (o'/2)1''", 
or  approximately  —  log(0.025)/«  =  3.69 /n  when  a  =  0.05. 
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b.  For  the  adaptation  of  the  Clopper-Pearson  approach  using  the  mid  P- value, 
show  that  the  upper  bound  is  1  —  a]/l‘,  or  approximately  —  log(0.05)/n  =  3 /n 
when  a  =  0.05. 


16.31  For  a  flip  of  a  coin,  let  n  =  P(head).  An  experiment  uses  n  =  5  independent 
flips.  Suppose  that  truly  n  =  0.50.  Explain  why  the  probability  that  the  95% 
Clopper-Pearson  confidence  interval  contains  n  equals  1.0.  [Hint:  Is  there  any 
possible  y  for  which  both  one-sided  tests  of  Hq\  n  =  0.50  have  P-value  <  0.025?] 

16.32  Consider  the  95%  score  confidence  interval  for  the  binomial  parameter  tc  .  When 
y  =  1,  show  that  the  lower  limit  is  approximately  0.18/n;  in  fact,  0  <  n  <  0.18//J 
then  falls  in  an  interval  only  when  y  =  0.  Argue  that  for  large  n  and  n  just  barely 
below  0.1 8/«  or  just  barely  above  1  —  0.18/n,  the  coverage  probability  is  about 
£-0.18  _  0.84.  Hence,  even  as  n  — »  oo,  this  method  can  have  coverage  probability 
much  less  than  0.95  (Agresti  and  Coull  1998;  Blyth  and  Still  1983). 

16.33  Show  that  the  conditional  ML  estimate  of  9  satisfies  n\\  =  E(n\\)  for  distribution 
(16.28). 

16.34  For  the  geometric  distribution  p(y)  =  ny(\  —  n),y  =0,  1, 2, show  that  equat¬ 
ing  P(Y  >  y)  and  P(Y  <  y)  to  a/2  yields  the  confidence  interval  [(a/2)x/y ,  (1  — 
a/2)1/(v+l)].  Show  that  all  re  between  0  and  1  -  a/2  never  fall  above  a  confi¬ 
dence  interval,  and  hence  the  actual  coverage  probability  exceeds  1  —  a/2  over 
this  region. 

16.35  Consider  marginal  homogeneity  for  an  /  x  /  table. 

a.  Letting  F(n)  =  An,  explain  how  (i)  F(n )  =  0,  where  A  has  /  —  1  rows,  and 
(ii)  F{n)  =  Xfi,  where  A  has  2(1  —  1)  rows  and  j8  has  /  —  1  elements.  In  part 
(ii),  show  A,  n,  X,  fi  when  1=3. 

b.  Explain  how  to  use  WLS  to  test  marginal  homogeneity.  [This  is  Bhapkar’s  test 
(11.15).] 

c.  Explain  why  the  minimum  modified  chi-squared  estimates  are  identical  to  WLS 
estimates. 

16.36  With  WLS,  show  that  [F(p)  —  X fi]T  Vp'[F(p)  —  X ft]  is  minimized  by  b  = 
(XTV}'X)~lXTVpiF(p). 

16.37  The  response  functions  F(p)  have  asymptotic  covariance  matrix  V  /.-.  Derive  the 
asymptotic  covariance  matrices  of  the  WLS  model  parameter  estimator  b  and  the 
predicted  values  F  =  Xb. 

16.38  Let  y,  be  a  bin («,-,  n ,  )  variate  for  group  i,  i  =  1, . . . ,  N ,  with  [y,  ]  independent. 
Consider  the  model  that  n\  =  ■  ■  ■  =  Tty .  Denote  the  common  value  by  n. 

a.  Show  that  the  ML  estimator  of  n  is  p  =  (£L  y,  )/(Yl,  ni )• 
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b.  The  minimum  chi-squared  estimator  n  is  the  value  of  n  minimizing 

[(yi/rij)  -tt]2  A  [(yj/nj)  -  tc]1 
i= 1  i=\ 

The  second  term  results  from  comparing  (1  —  yj/tit)  to  (1  —  tt),  the  proportions 
in  the  second  category.  If  n\  =  •  •  •  =  «/v  =  1,  show  that  ft  minimizes  Np(  1  — 
7t)/jt  +  N(  1  -  p) tt/(1  -  tt).  Hence  show 


n  =  pV2/[pv2  +  {]_py/2l 


Note  the  considerable  bias  toward  |  in  this  estimator. 

c.  As  N  -a-  oo  with  all  «/  =  1,  argue  that  the  ML  estimator  is  consistent  but  the 
minimum  chi-squared  estimator  is  not  (Mantel  1985). 

d.  For  N  =  2  groups,  find  the  minimum  modified  chi-squared  estimator  of  tt. 
Compare  it  to  the  ML  estimator. 


CHAPTER  17 


Historical  Tour  of  Categorical 
Data  Analysis 


This  book  concludes  with  an  informal  historical  overview  of  the  evolution  of  methods  for 
categorical  data  analysis  (CDA).  We  have  seen  that  categorical  scales  are  pervasive  in  the 
social  sciences  and  the  biomedical  sciences.  Not  surprisingly,  the  development  of  GLMs 
for  categorical  responses  was  fostered  by  statisticians  having  ties  to  the  social  sciences  or 
to  the  biomedical  sciences. 

Only  in  the  last  quarter  of  the  twentieth  century  did  these  models  receive  the  attention 
given  early  in  the  century  to  models  for  continuous  data.  Regression  models  for  continuous 
variables  evolved  out  of  Francis  Galton’s  breakthroughs  in  the  1 880s.  The  strong  influence 
of  R.  A.  Fisher,  G.  Udny  Yule,  and  other  statisticians  on  experimentation  in  agriculture  and 
biological  sciences  ensured  widespread  adoption  of  regression  and  ANOVA  modeling  by 
the  mid-twentieth  century.  On  the  other  hand,  despite  influential  articles  around  1900  by 
Karl  Pearson  and  Yule  on  association  between  categorical  variables,  models  for  categorical 
responses  received  scant  attention  until  the  1960s.  Stigler  (2002)  noted  that  even  simple 
two-way  contingency  tables  rarely  appeared  in  scientific  literature  before  1900,  and  the 
analyses  that  were  attempted  mainly  focused  on  summarizing  margins  or  reducing  the  data 
to  2x2  tables. 

The  beginnings  of  CDA  were  often  shrouded  in  controversy.  Key  figures  in  the  develop¬ 
ment  of  statistical  science  made  groundbreaking  contributions,  but  these  statisticians  were 
often  in  heated  disagreement  with  one  another. 


17.1  PEARSON-YULE  ASSOCIATION  CONTROVERSY 

Much  of  the  early  development  of  methods  for  CDA  took  place  in  England,  and  it  is  fitting 
that  we  begin  our  historical  tour  in  London  at  the  beginning  of  the  twentieth  century.  The 
year  1900  is  an  apt  starting  point,  since  in  that  year  Karl  Pearson  introduced  his  chi-squared 
statistic  ( X  2)  and  G.  Udny  Yule  presented  the  odds  ratio  and  related  measures  of  association. 
Before  then,  most  work  focused  on  descriptive  aspects  for  relatively  simple  measures.  For 
instance,  Goodman  and  Kruskal  (1959)  noted  that  the  Belgian  social  statistician  Adolphe 
Quetelet  used  the  relative  risk  in  1849. 
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By  1900,  Karl  Pearson  (1857-1936)  was  already  well  known  in  the  statistics  commu¬ 
nity.  He  was  head  of  a  statistical  laboratory  at  University  College  in  London.  His  work 
the  previous  decade  included  developing  a  large  family  of  probability  distributions  (called 
Pearson  curves,  they  included  important  families  with  skew,  such  as  the  gamma),  obtaining 
the  product-moment  estimate  of  the  correlation  coefficient  and  finding  its  standard  error, 
and  extending  Gabon’s  work  on  linear  regression.  In  fact,  Pearson  was  a  true  renaissance 
man,  writing  on  a  wide  variety  of  topics  that  included  art,  religion,  philosophy,  law,  social¬ 
ism,  women’s  rights,  physics,  genetics,  eugenics,  and  evolution.  Pearson’s  motivation  for 
developing  the  chi-squared  test  included  testing  whether  outcomes  on  a  roulette  wheel  in 
Monte  Carlo  varied  randomly,  checking  the  fit  to  various  data  sets  of  normal  distributions 
and  Pearson  curves,  and  testing  statistical  independence  in  two-way  contingency  tables. 

Much  of  the  literature  on  CDA  early  in  the  twentieth  century  consisted  of  vocal  debates 
about  appropriate  ways  to  summarize  association.  Pearson’s  approach  assumed  that  con¬ 
tinuous  bivariate  distributions  underlie  two-way  contingency  tables  (Pearson  1904,  1913). 
He  argued  in  favor  of  approximating  a  measure,  such  as  the  correlation,  for  the  underlying 
continuum.  In  1904,  Pearson  introduced  the  term  contingency  as  a  “measure  of  the  total 
deviation  of  the  classification  from  independent  probability,”  and  he  introduced  measures 
to  describe  its  extent,  such  as  the  tetrachoric  correlation  (Section  2.4.8).  The  mean-square 
contingency  and  the  contingency  coefficient  are  normalizations  of  X2  to  the  (0,  1 )  scale.  The 
contingency  coefficient  (Exercise  3.32)  for  /  x  J  tables  standardized  X2  to  approximate 
an  underlying  correlation. 

George  Udny  Yule  (1871-1951),  a  British  contemporary  of  Pearson’s,  took  a  different 
approach.  Having  completed  pioneering  work  developing  multiple  regression  models  and 
multiple  and  partial  correlation  coefficients.  Yule  turned  his  attention  between  1 900  and 
1912  to  association  in  contingency  tables.  He  believed  that  many  categorical  variables, 
such  as  (vaccinated,  unvaccinated)  and  (died,  survived),  are  inherently  discrete.  Yule  de¬ 
fined  indices  directly  using  cell  counts  without  assuming  an  underlying  continuum.  He 
popularized  the  odds  ratio  6  [which  Goodman  (2000)  noted  may  first  have  been  proposed 
by  a  Hungarian  statistician,  J.  Korbsy]  and  a  transformation  of  it  to  the  [—1,  +1]  scale, 
Q  —  (0  —  1  )/(6  +  1),  now  called  Yule’s  Q  (Exercise  2.38).  Discussing  one  of  Pearson’s 
measures  that  assumes  underlying  normality.  Yule  argued  (1912,  p.  612)  that  “at  best  the 
normal  coefficient  can  only  be  said  to  give  us  in  cases  like  these  a  hypothetical  correlation 
between  supposititious  variables.  The  introduction  of  needless  and  unverifiable  hypotheses 
does  not  appear  to  me  a  desirable  proceeding  in  scientific  work.”  Yule  (1903)  also  showed 
the  potential  discrepancy  between  marginal  and  conditional  associations  in  contingency 
tables,  later  studied  by  E.  H.  Simpson  (1951)  and  now  called  Simpson’s  paradox. 

In  the  first  quarter  of  the  twentieth  century,  Karl  Pearson  was  the  rarely  challenged  leader 
of  statistical  science  in  Britain.  Pearson’s  strong  personality  did  not  take  kindly  to  criticism, 
and  he  reacted  negatively  to  Yule’s  ideas,  arguing  that  Yule’s  measures  were  unsuitable. 
For  instance,  Pearson  claimed  that  their  values  were  unstable,  since  different  collapsings  of 
/  x  /  tables  to  2  x  2  tables  could  produce  quite  different  values  of  the  measures.  Pearson 
and  D.  Heron  (1913)  filled  more  than  150  pages  of  Biometrika,  a  journal  he  co-founded 
and  edited,  with  a  scathing  reply  to  Yule’s  criticism.  In  a  passage  critical  also  of  Yule’s 
well-received  book  An  Introduction  to  the  Theory  of  Statistics,  they  stated  “If  Mr.  Yule’s 
views  are  accepted,  irreparable  damage  will  be  done  to  the  growth  of  modern  statistical 
theory  ....  [Yule’s  Q ]  has  never  been  and  never  will  be  used  in  any  work  done  under  his 
[Pearson’s]  supervision ....  We  regret  having  to  draw  attention  to  the  manner  in  which 
Mr.  Yule  has  gone  astray  at  every  stage  in  his  treatment  of  association,  but  criticism  of 
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his  methods  has  been  thrust  on  us  not  only  by  Mr.  Yule’s  recent  attack,  but  also  by  the 
unthinking  praise  which  has  been  bestowed  on  a  text-book  which  at  many  points  can  only 
lead  statistical  students  hopelessly  astray.”  Pearson  and  Heron  attacked  Yule’s  “half-baked 
notions”  and  “specious  reasoning”  and  argued  that  Yule  would  have  to  withdraw  his  ideas 
“if  he  wishes  to  maintain  any  reputation  as  a  statistician.” 

In  retrospect,  Pearson  and  Yule  both  had  valid  points.  Some  classifications,  such  as 
most  nominal  variables,  have  no  apparent  underlying  continuous  distribution.  On  the  other 
hand,  many  applications  relate  naturally  to  an  underlying  continuum,  and  we’ve  seen 
that  latent  variable  models  can  motivate  many  standard  models  and  inferences  (e.g..  Sec¬ 
tions  7.1.1  and  8.2.3).  Goodman  (198 la, b)  noted  that  the  ordinal  models  presented  in 
Sections  10.4.1  and  10.5.2  provide  a  sort  of  reconciliation  between  Yule  and  Pearson, 
since  Yule’s  odds  ratio  characterizes  models  that  fit  well  when  underlying  distributions  are 
approximately  normal. 

Half  a  century  after  the  Pearson- Yule  controversy,  Leo  Goodman  and  William  Kruskal 
surveyed  the  development  of  association  measures  for  contingency  tables  and  made  many 
contributions  of  their  own.  Their  1979  book  reprinted  four  influential  articles  of  theirs  from 
the  Journal  of  the  American  Statistical  Association  on  this  topic.  Initial  development  of 
many  measures  occurred  in  the  nineteenth  century.  Their  1 959  article  contains  the  following 
quote  from  M.  H.  Doolittle  in  1887,  which  illustrates  the  lack  of  precision  in  early  attempts 
to  quantify  the  meaning  of  association  even  in  2  x  2  tables:  “Having  given  the  number  of 
instances  respectively  in  which  things  are  both  thus  and  so,  in  which  they  are  thus  but  not 
so,  in  which  they  are  so  but  not  thus,  and  in  which  they  are  neither  thus  nor  so,  it  is  required 
to  eliminate  the  general  quantitative  relativity  inhering  in  the  mere  thingness  of  the  things, 
and  to  determine  the  special  quantitative  relativity  subsisting  between  the  thusness  and  the 
soness  of  the  things.”  Goodman  (2000)  added  to  the  historical  survey  and  proposed  a  new 
measure. 


17.2  R.  A.  FISHER’S  CONTRIBUTIONS 

Pearson’s  disagreements  with  Yule  were  minor  compared  with  his  later  ones  with  Ronald 
A.  Fisher  (1890-1962).  Using  a  geometric  representation,  Fisher  (1922)  introduced  degrees 
of  freedom  to  characterize  the  family  of  chi-squared  distributions.  Fisher  claimed  that  for 
tests  of  independence  in  /  x  J  tables,  X 2  has  df  =  (/  —  1)(7  —  1).  By  contrast,  Pearson 
(1900,  1904)  had  argued  that  for  any  application  of  X2,  the  index  that  Fisher  later  identified 
as  df  equals  the  number  of  cells  minus  1,  or  //  —  1  for  two-way  tables.  Fisher  pointed  out, 
however,  that  estimating  hypothesized  cell  probabilities  using  estimated  row  and  column 
probabilities  resulted  in  an  additional  (I  —  !)  +  (/—  1)  constraints  on  the  fitted  values, 
thus  affecting  the  distribution  of  X2. 

Not  surprisingly,  Pearson  (1922)  reacted  critically  to  Fisher’s  suggestion  that  his  df 
formula  was  incorrect.  He  stated:  “I  hold  that  such  a  view  [Fisher’s]  is  entirely  erroneous, 
and  that  the  writer  has  done  no  service  to  the  science  of  statistics  by  giving  it  broad-cast 

circulation  in  the  pages  of  the  Journal  of  the  Royal  Statistical  Society - I  trust  my  critic 

will  pardon  me  for  comparing  him  with  Don  Quixote  tilting  at  the  windmill;  he  must  either 
destroy  himself,  or  the  whole  theory  of  probable  errors,  for  they  are  invariably  based  on 
using  sample  values  for  those  of  the  sampled  population  unknown  to  us.”  Pearson  claimed 
that  using  row  and  column  sample  proportions  to  estimate  unknown  probabilities  had 
negligible  effect  on  large-sample  distributions,  although  he  had  realized  (Pearson  1917) 
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that  df  must  be  adjusted  when  the  cell  counts  have  linear  constraints.  Fisher  was  unable  to 
get  his  rebuttal  published  by  the  Royal  Statistical  Society,  and  he  ultimately  resigned  his 
membership. 

Statisticians  soon  realized  that  Fisher  was  correct,  but  he  maintained  much  bitterness 
over  this  and  other  dealings  with  Pearson.  In  the  preface  to  a  later  volume  of  his  collected 
works,  he  remarked  that  his  1922  article  “had  to  find  its  way  to  publication  past  critics  who, 
in  the  first  place,  could  not  believe  that  Pearson’s  work  stood  in  need  of  correction,  and  who, 
if  this  had  to  be  admitted,  were  sure  that  they  themselves  had  corrected  it.”  Writing  about 
Pearson,  he  stated:  “If  peevish  intolerance  of  free  opinion  in  others  is  a  sign  of  senility,  it  is 
one  which  he  had  developed  at  an  early  age.”  In  Fisher  ( 1 926),  he  was  able  to  dig  the  knife  a 
bit  deeper  into  the  Pearson  family  using  1 1,688  2  x  2  tables  randomly  generated  assuming 
independence  by  Karl  Pearson’s  son,  E.  S.  Pearson.  Fisher  showed  that  the  sample  mean 
of  X2  for  these  tables  was  1.00001,  much  closer  to  the  1.0  predicted  by  his  formula  for 
E(X2)  of  df  =  (/  —  1)(7  —  1)  =  1  than  Pearson’s  I J  —  1=3.  His  daughter,  Joan  Fisher 
Box  (1978),  discussed  this  and  other  conflicts  between  Fisher  and  Pearson.  See  Stigler 
(2008)  for  a  discussion  of  the  error  in  Pearson’s  argument,  and  see  Fienberg  (1980),  Hald 
(1998,  pp.  652-663),  Plackett  (1983),  and  Stigler  (1999,  Chap.  19)  for  more  about  this 
controversy. 

Fisher’s  preeminent  reputation  among  statisticians  today  accrues  mainly  from  his  theo¬ 
retical  work  (introducing  concepts  such  as  sufficiency,  information,  and  optimal  properties 
of  ML  estimators)  and  his  methodological  contributions  to  the  design  of  experiments  and 
the  analysis  of  variance.  Although  not  so  well  known  for  work  in  CDA,  he  made  other 
interesting  contributions.  Moreover,  he  made  good  use  of  the  methods  in  his  applied  work. 
For  instance,  Fisher  was  also  a  famed  geneticist.  In  one  article,  he  used  Pearson’s  goodness- 
of-fit  test  to  check  Mendel’s  theories  of  natural  inheritance  and  showed  that  the  fit  was  too 
good  (Section  1.5.4). 

Fisher  realized  the  limitations  of  large-sample  methods  for  laboratory  work,  and  he 
was  at  the  forefront  of  advocating  specialized  small-sample  methods.  Writing  about  large- 
sample  methods  in  the  preface  to  the  first  edition  of  his  classic  text  Statistical  Methods 
for  Research  Workers ,  he  stated:  “[T]he  traditional  machinery  of  statistical  processes  is 
wholly  unsuited  to  the  needs  of  practical  research.  Not  only  does  it  take  a  cannon  to  shoot  a 
sparrow,  but  it  misses  the  sparrow!  The  elaborate  mechanism  built  on  the  theory  of  infinitely 
large  samples  is  not  accurate  enough  for  simple  laboratory  data.  Only  by  systematically 
tackling  small  sample  problems  on  their  merits  does  it  seem  possible  to  apply  accurate 
tests  to  practical  data.”  Fisher  was  among  the  first  to  promote  the  work  by  W.  S.  Gosset 
(pseudonym  “Student”)  on  the  t  distribution.  The  fifth  edition  of  Statistical  Methods  for 
Research  Workers  (1934)  introduced  Fisher’s  exact  test  for  2  x  2  contingency  tables.  In 
his  1935  book  The  Design  of  Experiments,  Fisher  described  the  tea-tasting  experiment 
(Section  3.5.2)  motivated  by  his  experience  at  an  afternoon  tea  break  while  employed  at 
Rothamsted  Experimental  Station. 

The  mid-1930s  finally  saw  more  attention  to  model  building  for  categorical  responses. 
Chester  Bliss  (1935),  following  up  a  1933  report  on  quantal  response  methods  by  J.  H. 
Gaddum,  popularized  the  probit  model  for  applications  in  toxicology  with  a  binary  response. 
Bliss  introduced  the  term  probit  but  used  the  inverse  normal  cdf  with  mean  5  (rather  than 
0,  in  order  to  avoid  negative  values)  and  standard  deviation  1 .  Stigler  ( 1 986,  p.  246)  and 
Finney  (1971)  attributed  the  first  use  of  inverse  normal  cdf  transformations  of  proportions 
to  the  German  physicist  Gustav  Fechner  in  his  1 860  book  Elemente  der  Psychophysik.  See 
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Finney  (1971 )  and  McCulloch  (2000)  for  other  history  of  the  probit  method  and  Chapter  9 
of  Cramer  (201 1 )  for  a  survey  of  the  early  origins  of  binary  regression  models. 

In  the  appendix  of  Bliss  (1935),  Fisher  (1935b)  outlined  an  algorithm  for  finding  ML 
estimates  of  model  parameters.  That  algorithm  was  a  Newton-Raphson  type  of  method 
using  expected  information,  today  commonly  called  Fisher  scoring  (Section  4.6.2).  Fisher 
(1954)  argued  for  using  ML  for  binary  models  with  an  appropriate  link  function  in  place  of 
the  popular  approach  at  the  time  of  applying  a  variance-stabilizing  transformation  in  order 
to  use  ordinary  least-squares  methods. 

The  definition  for  homogeneous  association  (no  interaction)  in  contingency  tables  origi¬ 
nated  in  an  article  by  the  British  statistician  Maurice  Bartlett  ( 1 935)  about  2x2x2  tables. 
Bartlett  showed  how  to  solve  a  cubic  equation  to  find  ML  estimates  of  cell  probabilities 
satisfying  the  property  of  equality  of  odds  ratios  between  two  variables  at  each  level  of 
the  third.  He  attributed  the  idea  to  Fisher.  In  the  same  year,  Sam  Wilks  proposed  the 
likelihood-ratio  test  of  independence  in  contingency  tables. 

In  1940,  Fisher  developed  canonical  correlation  methods  for  contingency  tables.  He 
showed  how  to  assign  scores  to  rows  and  columns  of  a  contingency  table  to  maximize 
the  correlation.  His  work  relates  to  the  later  development,  particularly  in  France,  of  corre¬ 
spondence  analysis  methods  (e.g.,  Benzecri  1973).  In  the  same  year,  Deming  and  Stephan 
showed  how  to  apply  iterative  proportional  fitting  (IPF)  for  raking  contingency  tables  to 
maintain  associations  while  satisfying  fixed  marginal  distributions. 

R.  A.  Fisher  has  been  the  greatest  influence  on  the  practice  of  modern  statistical  science. 
The  biography  by  his  daughter  (Box  1978)  gives  a  fascinating  account  of  his  impressive 
contributions  to  statistics  and  genetics.  Fienberg  (1980)  summarized  his  contributions  to 
CDA. 


17.3  LOGISTIC  REGRESSION 

The  logit  transform  showed  up  sporadically  before  its  use  in  binomial  logistic  regression. 
For  example,  Bartlett  (1937)  used  log[y/(l  —  y)J  in  regression  and  ANOVA  to  transform 
observations  y  that  are  continuous  proportions  (Exercise  4.35),  and  in  a  book  of  statistical 
tables  published  in  1938,  R.  A.  Fisher  and  Frank  Yates  suggested  it  as  a  possible  trans¬ 
formation  of  a  binomial  parameter  for  analyzing  binary  data.  In  1944,  the  physician  and 
statistician  Joseph  Berkson  introduced  the  term  logit  for  this  transformation,  and  following 
Wilson  and  Worcester  (1943),  who  had  employed  it  for  estimating  LD50,  proposed  the  lo¬ 
gistic  regression  model.  Berkson  showed  that  the  logistic  model  fitted  similarly  to  the  probit 
model,  and  his  subsequent  work  did  much  to  popularize  the  model.  He  argued,  however, 
for  fitting  using  the  computationally  simpler  minimum  logit  chi-squared  rather  than  ML. 
[See  also  his  comments  following  Fisher  (1954)  in  this  regard.]  In  1951,  Jerome  Cornfield, 
another  statistician  with  strong  medical  ties,  used  the  odds  ratio  to  approximate  relative 
risks  in  case-control  studies.  Dyke  and  Patterson  (1952)  apparently  first  used  the  logit  in 
models  with  qualitative  predictors. 

Sir  David  Cox  introduced  many  statisticians  to  logistic  regression,  through  his  influential 
1958  article  and  1970  book.  The  Analysis  of  Binary  Data.  About  the  same  time,  an  article 
by  the  Danish  statistician  and  mathematician  Georg  Rasch  sparked  an  enormous  literature 
on  item-response  models.  The  most  important  of  these  is  the  logit  model  with  subject 
and  item  parameters,  now  called  the  Rasch  model  (Section  13.1.4).  This  work  was  highly 
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influential  in  the  psychometric  community  of  northern  Europe  (especially  in  Denmark,  the 
Netherlands,  and  Germany)  and  spurred  many  generalizations  in  the  educational  testing 
community  in  the  United  States. 

The  extension  of  logistic  regression  to  multicategory  responses  received  occasional 
attention  before  1970  (e.g..  Mantel  1966)  but  substantial  work  after  about  that  date.  For 
nominal  responses,  early  work  was  mainly  in  the  econometrics  literature.  See  Bock  (1970), 
McFadden  (1974),  and  Theil  (1969,  1970).  In  2000,  Daniel  McFadden  won  the  Prize  in 
Economic  Sciences  in  Memory  of  Alfred  Nobel  for  his  work  in  the  1970s  and  1980s  on 
the  discrete-choice  model  (Section  8.5).  For  cumulative  logit  models  for  ordinal  responses, 
see  Bock  and  Jones  (1968),  Simon  (1974),  Snell  (1964),  Walker  and  Duncan  (1967),  and 
Williams  and  Grizzle  (1972).  The  cumulative  probit  case,  shown  to  result  from  a  normal 
latent  variable  model  (McKelvey  and  Zavoina  1975),  has  a  longer  history;  see,  for  instance, 
Aitchison  and  Silvey  (1957)  and  Bock  and  Jones  (1968,  Chap.  8).  Cumulative  logit  and 
probit  models  received  much  more  attention  following  publication  of  McCullagh  (1980), 
which  provided  a  Fisher  scoring  algorithm  for  ML  fitting  of  all  cumulative  link  models. 

The  next  major  advances  with  logistic  regression  dealt  with  its  application  to  case-control 
studies  (e.g.,  Breslow  1996,  Mantel  1973,  Prentice  1976a,  Prentice  and  Pyke  1979;  see  also 
Section  5.1.4)  and  the  conditional  ML  approach  to  model  fitting  for  those  studies  and 
others  with  numerous  nuisance  parameters  (Breslow  et  al.  1978,  with  related  work  cited  in 
Note  7.5).  The  conditional  approach  was  later  exploited  in  small-sample  exact  inference 
(Hirji  et  al.  1987,  Mehta  and  Patel  1995).  See  also  Sections  7.3,  1 1.2,  16.5,  and  16.6. 

Nathan  Mantel,  whose  name  appears  in  the  preceding  two  paragraphs,  made  a  variety  of 
interesting  contributions  to  CDA.  Although  best  known  for  the  1959  Mantel-Haenszel  test 
and  related  odds  ratio  estimator,  he  also  discussed  trend  tests  (1963),  multinomial  logit  and 
loglinear  modeling  (1966),  logistic  regression  for  case-control  data  (1973),  the  number  of 
contingency  tables  having  fixed  margins  (Gail  and  Mantel  1977),  the  analysis  of  square 
contingency  tables  (Mantel  and  Byar  1978),  and  problems  with  minimum  chi-squared  and 
Wald  tests  (1985,  1987a). 

Logistic  regression  has  become  a  useful  component  of  causal  inference  methods  and 
methods  of  dealing  with  missing  data.  An  example  is  the  introduction  by  Rosenbaum  and 
Rubin  (1983)  of  the  propensity  score  for  modeling  the  probability  of  being  in  some  treat¬ 
ment  group,  as  a  device  of  adjusting  for  bias  in  treatment  comparisons  with  observational 
studies. 

More  recently,  attention  has  focused  on  fitting  logistic  models  to  correlated  responses 
for  clustered  data.  One  strand  of  this  is  marginal  modeling  of  longitudinal  data  (Liang 
and  Zeger  1986,  Liang  et  al.  1992,  Lipsitz  et  al.  1994).  Much  of  this  literature  focuses  on 
quasi-likelihood  methods  such  as  generalized  estimating  equations  (GEEs).  Another  strand 
is  generalized  linear  mixed  models  (e.g.,  Breslow  and  Clayton  1993.  Pierce  and  Sands 
1975). 

Perhaps  the  most  far-reaching  contribution  of  the  past  half  century  has  been  the  introduc¬ 
tion  by  British  statisticians  John  Nelder  and  R.  W.  M.  Wedderbum  in  1972  of  the  concept 
of  generalized  linear  models.  This  unifies  the  logistic  and  probit  regression  models  for 
binomial  data  with  loglinear  models  for  Poisson  data  and  with  long-established  regression 
and  ANOVA  models  for  normal-response  data.  Interestingly,  the  algorithm  they  used  to  fit 
GLMs  is  Fisher  scoring,  which  R.  A.  Fisher  introduced  in  1935  for  ML  fitting  of  probit 
models.  McCulloch  (2000)  reviewed  the  journey  from  probit  models  to  GLMs  and  their 
further  generalizations  such  as  quasi-likelihood. 
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17.4  MULTIWAY  CONTINGENCY  TABLES  AND  LOGLINEAR  MODELS 

The  quarter  century  following  the  end  of  World  War  11  saw  the  development  of  a  theoretical 
underpinning  for  models  for  contingency  tables.  H.  Cramer  (1946)  and  C.  R.  Rao  (1957, 
1963)  derived  general  expressions  for  large-sample  distributions  of  parameter  estimators. 

In  1949,  the  Berkeley-based  statistician  Jerzy  Neyman,  who  had  already  performed  fun¬ 
damental  work  on  hypothesis  testing  and  interval  estimation  methods  with  E.  S.  Pearson, 
introduced  the  family  of  best  asymptotically  normal  (BAN)  estimators.  These  have  the 
same  optimal  large-sample  properties  as  ML  estimators.  The  BAN  family  includes  estima¬ 
tors  obtained  by  minimizing  chi-squared-type  measures  comparing  observed  proportions  to 
proportions  predicted  by  the  model  (Section  16.7.4).  This  type  of  estimator  itself  includes 
some  weighted  least  squares  (WLS)  estimators.  The  simplicity  of  their  computation,  com¬ 
pared  with  ML  estimators,  was  an  important  consideration  before  the  advent  of  modern 
computing.  Neyman’s  ( 1 949)  only  mention  of  Lisher  was  the  suggestion  that  Lisher  did  not 
realize  that  estimators  other  than  ML  could  be  BAN,  stating  that  “the  results  . . .  contradict 
the  assertion  of  R.  A.  Lisher,  not  a  very  clear  one,  that  ‘the  maximum  likelihood  equation 
may  indeed  be  derived  from  the  conditions  that  it  shall  be  linear  in  frequencies,  and  efficient 
for  all  values  of  0’.”  In  fact,  Lisher  had  realized  in  the  1920s  that  other  estimators  could 
be  efficient  (see  Stigler’s  "The  epic  story  of  maximum  likelihood”  2007  article  in  Statist. 
Sci.),  and  he  often  returned  the  jab  at  Neyman,  such  as  in  writing  (1956)  about  proposals 
for  an  unconditional  test  for  2  x  2  tables,  “the  principles  of  Neyman  and  Pearson’s  ‘Theory 
of  Testing  Hypotheses’  are  liable  to  mislead  those  who  follow  them  into  much  wasted 
effort.” 

In  the  early  1950s,  William  Cochran  published  work  dealing  with  a  variety  of  important 
topics  in  CDA.  Scottish-born,  Cochran  spent  most  of  his  career  at  American  universities: 
Iowa  State,  North  Carolina  State,  Johns  Hopkins,  and  Harvard.  Cochran  (1940)  mod¬ 
eled  Poisson  and  binomial  responses  with  variance-stabilizing  transformations.  He  (1943) 
recognized  and  discussed  ways  of  dealing  with  overdispersion.  His  comments  following 
Lisher  (1954)  also  dealt  with  practical  issues,  such  as  dangers  of  relying  solely  on  ML 
methods  when  extra  heterogeneity  existed  beyond  what  standard  distributions  assume. 
Cochran  (1950)  introduced  a  generalization  (Cochran’s  Q)  of  McNemar’s  test  for  com¬ 
paring  proportions  in  several  matched  samples.  His  classic  1954  article  is  a  mixture  of 
new  methodology  and  advice  for  applied  statisticians.  It  gave  sample-size  guidelines  for 
chi-squared  approximations  to  work  well  for  the  X2  statistic,  pointing  out  that  the  guide¬ 
line  that  expected  frequencies  should  exceed  5  was  often  too  strict.  It  also  stressed  the 
importance  of  directing  inferences  toward  narrow  (e.g.,  single-degree-of-freedom)  alter¬ 
natives  and  partitioning  chi-squared  statistics  into  components.  One  instance  of  this  was 
Cochran’s  proposed  test  of  conditional  independence  in  several  2x2  tables,  which  was 
closely  related  to  the  Mantel  and  Haenszel  (1959)  test  (Section  6.4.2).  Another  was  a  test 
for  a  linear  trend  in  proportions  across  quantitatively  defined  rows  of  an  /  x  2  table  (Sec¬ 
tion  5.3.5).  See  also  Cochran  (1955).  Lienberg  (1984)  reviewed  Cochran's  contributions 
to  CDA. 

Bartlett’s  work  on  interaction  structure  in  2  x  2  x  2  contingency  tables  had  relatively 
little  impact  for  20  years.  Indeed,  in  presenting  methods  for  partitioning  X2  in  2  x  2  x  2 
tables,  Lancaster  (1951)  noted  that  “Doubtless  little  use  will  ever  be  made  of  more  than 
a  three-dimensional  classification.”  However,  in  the  mid-1950s  and  early  1960s,  Bartlett’s 
work  was  extended  in  many  ways  to  multiway  tables.  See,  for  instance,  Darroch  (1962), 
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Good  (1963),  Goodman  (1964b),  Plackett  (1962),  Roy  and  Kastenbaum  (1956),  and  Roy 
and  Mitra  (1956).  These  articles  as  well  as  influential  articles  by  Martin  W.  Birch  (1963, 
1964a,b,  1965)  were  the  genesis  of  research  work  on  loglinear  models  between  about  1965 
and  1975.  Birch’s  work  was  part  of  a  never-submitted  Ph.D.  thesis  at  the  University  of 
Glasgow.  He  explicitly  provided  loglinear  model  formulas  and  explained  analogies  with 
factorial  ANOVA  models,  and  he  showed  how  to  obtain  ML  estimates  of  cell  probabilities 
in  three-way  tables  for  various  models.  He  showed  the  equivalence  of  those  ML  estimates 
for  Poisson  and  multinomial  sampling.  He  (and  Watson  1959)  extended  theoretical  results 
of  Cramer  and  Rao  on  large-sample  distributions  for  contingency  table  models. 

A  survey  article  by  the  French  statistician  Henri  Caussinus  (1966),  based  partly  on 
his  Ph.D.  thesis,  provides  a  good  glimpse  of  the  state-of-the-art  of  CDA  in  the  middle  of 
these  two  decades  of  advances.  There,  Caussinus  introduced  the  quasi-symmetry  model 
for  square  tables.  Issue  number  4  in  Volume  XI  of  Annales  de  la  Faculte  des  Sciences 
de  Toulouse  Mathematiques,  a  special  2002  issue  honoring  Caussinus  at  his  retirement, 
contains  remembrances  by  Caussinus  about  the  origins  of  this  contribution  as  well  as 
several  articles  investigating  this  property  and  its  links  and  extensions. 

Much  of  the  work  in  the  next  decades  on  loglinear  and  related  logit  modeling  took  place 
at  three  American  universities:  the  University  of  Chicago,  Harvard  University,  and  the 
University  of  North  Carolina.  At  Chicago,  Leo  Goodman  wrote  a  series  of  groundbreak¬ 
ing  articles,  dealing  with  such  topics  as  partitionings  of  chi-squared,  models  for  square 
tables  (e.g.,  quasi-independence),  stepwise  logit  and  loglinear  model-building  procedures, 
deriving  asymptotic  variances  of  ML  estimates  of  loglinear  parameters,  latent  class  models 
(building  on  early  work  by  Paul  Lazarsfeld),  association  models,  correlation  models,  and 
correspondence  analysis.  For  surveys  of  his  early  work,  see  Goodman  ( 1 968,  an  R.  A.  Fisher 
memorial  lecture,  1970).  For  later  work,  see  Goodman  (1985,  1996,  2000).  Goodman  also 
wrote  a  stream  of  articles  for  social  science  journals  that  had  a  substantial  impact  on  pop¬ 
ularizing  loglinear  and  logit  methods  for  applications  (e.g.,  Goodman  2007  and  references 
therein).  (See  Figure  17.1.) 

Over  the  past  60  years,  Goodman  has  been  the  most  prolific  contributor  to  the  ad¬ 
vancement  of  CDA  methodology.  The  field  owes  tremendous  gratitude  to  his  steady  and 
impressive  body  of  work.  In  addition,  some  of  Goodman’s  students  at  Chicago  also  made 
fundamental  contributions.  In  1970,  Shelby  Haberman  completed  a  Ph.D.  dissertation  (the 
basis  of  his  1974a  monograph)  making  substantial  theoretical  contributions  to  loglinear 
modeling.  Among  topics  he  considered  were  residual  analyses,  existence  of  ML  estimates, 
loglinear  models  for  ordinal  variables,  and  theoretical  results  for  models  (such  as  the  Rasch 
model)  for  which  the  number  of  parameters  grows  with  the  sample  size.  Clifford  Clogg 
followed  in  Goodman’s  footsteps  by  having  influence  in  the  social  sciences  and  in  statistics 
with  his  work  on  association  models,  demography,  models  for  rates,  the  census,  and  various 
other  topics. 

Simultaneously  with  Goodman’s  work,  related  research  on  ML  methods  for  loglinear- 
logit  models  occurred  at  Harvard  by  students  of  Frederick  Mosteller  (such  as  Stephen 
Fienberg)  and  William  Cochran.  Much  of  this  research  was  inspired  by  problems  arising 
in  analyzing  large,  multivariate  data  sets  in  the  National  Halothane  Study  (see  Chap.  5 
in  Mosteller’s  2010  autobiography.  The  Pleasures  of  Statistics).  That  study  investigated 
whether  halothane  was  more  likely  than  other  anesthetics  to  cause  death  due  to  liver 
damage.  A  presidential  address  by  Mosteller  ( 1968)  to  the  American  Statistical  Association 
described  early  uses  of  loglinear  models  for  smoothing  multidimensional  discrete  data  sets. 
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Karl  Pearson 


G.  Udny  Yule 


Ronald  A.  Fisher  Leo  Goodman 

Figure  17.1  Four  leading  figures  in  the  development  of  categorical  data  analysis. 


Yvonne  Bishop  (1969)  noted  the  equivalence  between  loglinear  and  logit  models  and 
showed  the  usefulness  of  IPF  for  model  fitting.  Fienberg  (I970ab)  dealt  with  theoretical 
aspects  of  IPF  as  well  as  existence  of  ML  estimates  for  square-table  models.  For  the  past 
40  years,  he  has  been  one  of  the  most  prolific  researchers  in  loglinear  modeling,  together 
with  many  of  his  Ph.D.  students.  A  landmark  book  in  1975  by  Bishop  and  Fienberg  with  Paul 
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Figure  17.2  Alan  Agresti  with  statisticians  from  the  North  Carolina  (Peter  Imrey.  Gary  Koch.  J.  Richard  Landis) 
and  Harvard  (Stephen  Fienberg)  schools  of  CDA.  Photo  taken  by  Babjal  Qat|ish  at  2009  UNC  Festschrift  for  Gary 
Koch. 


Holland,  Discrete  Multivariate  Analysis,  was  largely  responsible  for  introducing  loglinear 
models  to  the  general  statistical  community  and  remains  an  important  reference. 

Research  at  North  Carolina  by  Gary  Koch  and  colleagues  and  many  of  his  Ph.D.  students 
(such  as  J.  Richard  Landis,  Peter  Imrey,  and  Maura  Stokes)  has  been  highly  influential  in  the 
biomedical  sciences.  Their  research  developed  WLS  methods  for  categorical  data  models 
(Section  16.7.1).  The  1969  article  by  Koch  with  his  colleagues  J.  Grizzle  and  F.  Slarmer 
popularized  this  approach.  Koch  and  colleagues  extended  it  in  later  articles  to  an  impressive 
variety  of  problems,  including  problems  for  which  ML  methods  are  awkward  to  use,  such 
as  the  analysis  of  repeated  categorical  measurement  data  (Koch  et  al.  1 977).  I  n  1 966,  Vasant 
Bhapkar  showed  that  the  WLS  estimator  is  often  identical  to  Neyman’s  minimum  modified 
chi-squared  estimator.  Imrey  (201 1)  surveyed  Koch’s  contributions,  and  Fienberg  (201 1) 
related  the  UNC  work  to  related  developments  elsew  here.  (See  Figure  17.2.) 

The  early  literanire  on  loglinear  models  treated  all  classifications  as  nominal.  Haberman 
(1974b)  and  Simon  (1974)  showed  how  to  exploit  ordinality  of  classifications  in  loglinear 
models.  This  work  was  extended  in  several  articles  by  Leo  Goodman  (1979a.  1 98  lab, 
1983).  The  extensions  included  association  models,  which  can  replace  ordered  scores  in 
loglinear  models  by  parameters  (Section  10.5).  Goodman  (1 985. 1986, 1 9%)  also  discussed 
related  correlation  models  and  provided  a  model-based  perspective  for  essentially  equivalent 
correspondence  analysis  methods.  Joseph  Lang  ( 1 996a.  2004, 2005 )  has  extended  ML  fitting 
to  broad  classes  of  models,  including  generalized  loglinear  models. 

Certain  loglinear  models  with  conditional  independence  structure  provide  graphical 
models  for  contingency  tables  (Section  10.1.2).  The  article  by  Darroch  et  al.  (1980)  was 
the  genesis  of  much  of  this  work. 

Fienberg  and  Rinaldo  (2007)  provided  a  historical  overview'  of  the  development  of 
loglinear  models.  Thai  article  also  discusses  issues  still  to  be  resolved  adequately,  such  as 
whether  ML  estimates  exist  for  large,  sparse  contingency  tables  containing  many  sampling 
zeroes. 
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17.5  BAYESIAN  METHODS  FOR  CATEGORICAL  DATA 

We  now  summarize  the  development  of  Bayesian  methods  for  categorical  data  analysis. 
This  actually  dates  back  250  years,  as  Bayes  in  1763  (and  then  Laplace  in  1774)  estimated 
a  binomial  parameter  using  a  uniform  prior  distribution.  See  Stigler  (1986,  pp.  100-136) 
for  details. 

Early  applications  of  Bayesian  methods  to  contingency  tables  involved  smoothing  cell 
counts  to  improve  estimation  of  cell  probabilities.  In  particular,  large  sparse  tables  often 
contain  many  sampling  zeros,  for  which  0.0  is  unappealing  as  a  probability  estimate.  Also, 
Stein’s  results  for  estimating  multivariate  normal  means  suggest  that  lower  total  mean 
squared  error  occurs  with  Bayes  estimators  that  shrink  the  sample  proportions  toward  some 
average  value  (Efron  and  Morris  1975). 

1.  J.  Good  (1956)  used  log-normal  and  gamma  priors  in  estimating  association  factors 
in  contingency  tables  (Section  2.4.2).  Good’s  (1965)  monograph  summarized  the  use  of 
Bayesian  methods  for  estimating  multinomial  probabilities  in  contingency  tables,  using 
a  Dirichlet  prior  distribution.  Good  (1967)  focused  on  suitable  priors  for  multinomial 
probabilities  in  significance  tests  and  made  considerable  efforts  to  reconcile  Bayesian 
and  frequentist  inference,  such  as  by  relating  Bayes  factors  to  chi-squared  statistics.  He 
was  innovative  in  his  early  use  of  hierarchical  and  empirical  Bayesian  approaches.  Much 
of  Good’s  early  work  was  apparently  motivated  by  his  collaborations  with  Alan  Turing  at 
Bletchley  Park,  England,  during  World  War  11  in  work  toward  breaking  the  German  code  for 
its  wartime  communications.  Albert  (2010)  reviewed  Good’s  early  research  as  well  as  later 
related  work  on  smoothing  contingency  tables.  By  contrast,  early  critics  of  the  Bayesian 
approach  included  R.  A.  Fisher  (1956),  who  challenged  the  use  of  a  uniform  prior  for  the 
binomial  parameter,  noting  that  uniform  priors  on  other  scales  would  lead  to  different  results. 

For  2x2  tables,  Altham  (1969)  gave  a  Bayesian  analysis  comparing  parameters  for 
two  independent  binomial  samples,  using  independent  beta  priors  (Section  3.6.2).  Seneta 
and  Phipps  (2001)  noted  that  a  Swiss  medical  doctor,  Carl  Liebermeister,  had  suggested 
such  an  approach  with  uniform  priors  in  1877.  Altham  (1971)  showed  Bayesian  analyses 
for  binomial  proportions  from  matched-pairs  data.  The  Bayesian  approaches  presented  by 
then  focused  directly  on  cell  probabilities  by  using  a  prior  distribution  for  them.  In  an 
influential  article,  Lindley  (1964)  focused  on  estimating  summary  measures  of  association 
in  contingency  tables.  For  instance,  using  a  Dirichlet  prior  distribution  for  the  multinomial 
probabilities,  he  found  the  posterior  distribution  of  contrasts  of  log  probabilities,  such  as  the 
log  odds  ratio.  An  alternative  approach  (Leonard  1975,  Laird  1978)  focused  on  parameters 
of  the  saturated  loglinear  model,  using  normal  priors.  The  approach  of  using  normal  priors 
for  logits  received  considerable  attention  in  the  1970s  by  Leonard  and  others  (e.g.,  Leonard 
1972). 

In  the  context  of  model  selection  for  analyzing  contingency  tables,  Raftery  (1986) 
proposed  replacing  P-values  by  Bayes  factors.  He  suggested  BIC  as  a  simple  approximation 
to  2[log( Bayes  factor)].  Since  then,  there  has  been  an  enormous  literature  on  issues  of 
model  selection  including  model  averaging  (e.g.,  Madigan  and  Raftery  1994).  BIC  itself 
has  become  an  increasingly  popular  alternative  to  AIC,  but  for  criticisms  of  it,  see  articles 
by  Gelman  and  Rubin  and  others  in  the  February  1999  issue  of  Sociological  Methods  and 
Research.  Spiegelhaiter  et  al.  (2002)  proposed  a  deviance  information  criterion  (DIC)  as  a 
hierarchical  modeling  generalization  of  the  AIC. 

The  difficulty  of  calculating  the  posterior  distribution  when  the  prior  is  not  conjugate  is 
less  problematic  with  modern  ways  of  approximating  posterior  distributions  by  simulating 


634 


HISTORICAL  TOUR  OF  CATEGORICAL  DATA  ANALYSIS 


samples  from  them.  These  include  the  importance  sampling  generalization  of  Monte  Carlo 
simulation  (Zellner  and  Rossi  1984)  and  Markov  chain  Monte  Carlo  methods  such  as 
Gibbs  sampling  (Gelfand  and  Smith  1990).  Zellner  and  Rossi  used  Bayesian  methods  with 
importance  sampling  for  logistic  regression  and  Gelfand  and  Smith  considered  a  class  of 
multinomial  models  with  Dirichlet  prior.  Zeger  and  Karim  (1991)  fitted  generalized  linear 
mixed  models  (GLMMs)  essentially  using  a  Bayesian  framework  with  priors  for  fixed  and 
random  effects. 

The  Bayesian  literature  on  CDA  methodology  has  exploded  in  the  past  25  years  since 
the  introduction  of  MCMC  methods.  For  further  details  and  references,  see  Agresti  and 
Hitchcock  (2005,  also  at  the  text  website),  Congdon  (2005),  Leonard  (1999),  and  Leonard 
and  Hsu  (1994). 


17.6  A  LOOK  FORWARD,  AND  BACKWARD 

Methods  for  categorical  data  analysis  have  developed  in  dramatic  fashion  over  the  past  half 
century.  In  many  ways,  the  area  is  now  a  relatively  mature  one,  and  it  seems  unlikely  that 
the  development  will  be  nearly  as  dramatic  in  the  next  half  century.  However,  it  is  unwise 
to  think  that  this  can  be  predicted  without  considerable  uncertainty. 

As  in  all  branches  of  statistics,  it  does  seem  safe  to  predict  that  in  coming  years  a 
primary  topic  for  development  will  be  methods  for  dealing  with  data  sets  with  very  large 
numbers  of  variables.  With  modeling  methods,  there  is  the  challenge  of  developing  adequate 
model  checking  and  diagnostic  methods.  In  the  Bayesian  context,  there  is  the  challenge 
of  adequate  specification  of  prior  distributions  with  huge  numbers  of  parameters  so  that 
those  priors  are  not  overly  influential  in  the  analysis.  Some  research  on  methods  for  large 
numbers  of  variables  is  largely  outside  the  realm  of  traditional  modeling,  such  as  the 
data  mining  methods  briefly  introduced  in  Chapter  15.  For  these  and  other  complex  data 
structures  and  applications  that  place  a  premium  on  predictive  power,  methodologists  will 
need  to  find  ways  to  overcome  the  sacrifice  of  simplicity  and  interpretability  of  structure. 
Important  areas  of  application  are  likely  to  continue  to  include  genetics,  such  as  the  analysis 
of  discrete  DNA  sequences  in  the  form  of  very  high-dimensional  contingency  tables,  and 
business  applications  such  as  credit  scoring  and  market  basket  analysis  for  predicting  future 
behavior  of  customers. 

As  sources  for  the  historical  tour  in  this  chapter,  I  would  like  to  especially  acknowledge 
Stigler  (1986),  Studies  in  the  History  of  Probability  and  Statistics,  edited  by  E.  S.  Pearson 
and  M.  G.  Kendall  (London:  Griffin,  1970),  and  personal  conversations  over  the  years  with 
many  statisticians,  including  Erling  Andersen,  R.  L.  Anderson,  Henri  Caussinus,  Herman 
Chernoff,  William  Cochran,  Sir  David  Cox,  John  Darroch,  Leo  Goodman,  David  Hoaglin, 
Gary  Koch,  Frederick  Mosteller,  John  Nelder.  Ingram  Olkin,  C.  R.  Rao,  Donald  Rubin, 
Stephen  Stigler,  Geoffrey  Watson,  and  Marvin  Zelen. 

To  readers  who  have  made  it  this  far,  I  congratulate  your  perseverance!  To  gain  a  more 
complete  view  of  the  historical  development  of  CDA,  you  may  want  to  read  articles  such  as 
Fienberg  and  Rinaldo  (2007),  Goodman  (2000),  and  Imrey  et  al.  (1981,  1996),  or  browse 
through  some  early  books  on  this  topic,  such  as  R.  L.  Plackett’s  The  Analysis  of  Categorical 
Data  (London:  Griffin,  1974)  and  the  Bishop,  Fienberg,  and  Holland  Discrete  Multivariate 
Analysis  (Cambridge,  MA:  MIT  Press  1975).  Finally,  you  may  want  to  browse  the  fol¬ 
lowing  chronological  list  of  28  sources,  which  convey  a  sense  of  how  methodology  has 
evolved. 


A  LOOK  FORWARD,  AND  BACKWARD 


Pearson  ( 1 900) 

Yule  (1912) 

Fisher  (1922) 

Bartlett  (1935) 

Berkson  (1944) 

Neyman  (1949) 

Cochran  ( 1 954) 

Goodman  and  Kruskal  (1954) 
Roy  and  Mitra  (1956) 

Cox  (1958a) 

Mantel  and  Haenszel  (1959) 
Birch  (1963) 

Caussinus  (1966) 

Goodman  ( 1968) 


Mosteller(1968) 

Grizzle  et  al.  (1969) 

Goodman  (1970) 

Haberman  (1974a) 

Nelder  and  Wedderburn  ( 1 972) 
McFadden  (1974) 

Goodman  (1979a) 

McCulIagh  (1980) 

Liang  and  Zeger  ( 1 986) 

Breslow  and  Clayton  ( 1 993) 

Albert  and  Chib  (1993) 

Bickel  and  Levina  (2004) 

Lang  (2004) 

Fiastie,  Tibshirani,  and  Friedman  (2009) 


APPENDIX  A 


Statistical  Software  for 
Categorical  Data  Analysis 


In  this  appendix  we  very  briefly  summarize  statistical  software  for  categorical  data  analysis. 
A  much  more  detailed  appendix.  Using  Statistical  Software  for  Categorical  Data  Analysis, 
is  at  the  text  website: 

www . stat . uf 1 . edu/~aa/cda/cda . html 

That  appendix  presents  details  about  software  use  for  all  of  the  methods  presented  in  this 
text,  with  separate  sections  for  R,  SAS,  SPSS,  and  Stata.  It  also  shows  code  for  R  and  SAS 
for  many  examples  in  this  text.  We  have  placed  it  there  rather  than  in  the  hard  copy  of  the 
book  itself  (1)  because  of  the  rather  long  length  of  this  edition,  (2)  so  it  can  be  updated 
easily  over  time  as  software  capabilities  change,  and  (3)  to  make  it  easier  to  copy  particular 
examples  as  you  conduct  your  own  analyses. 


A.l  SAS 

SAS  is  general-purpose  software  for  a  wide  variety  of  statistical  analyses.  The  main 
procedures  (PROCs)  for  categorical  data  analyses  are  FREQ,  GENMOD,  LOGISTIC, 
NLMIXED,  GLIMMIX,  and  CATMOD. 

PROC  FREQ  computes  confidence  limits  for  the  binomial  proportion  including  score, 
Jeffreys  Bayes,  Agresti-Coull,  and  Clopper-Pearson  intervals,  equivalence  and  noninferi¬ 
ority  tests  for  the  binomial  proportion  and  the  proportion  difference,  unconditional  exact 
confidence  limits  for  the  proportion  (risk)  difference,  measures  of  association  and  their 
estimated  standard  errors,  multinomial  goodness-of-fit  tests,  tests  of  independence  in  lx  J 
tables  including  exact  small-sample  methods,  generalized  Cochran-Mantel-Haenszel  tests 
of  conditional  independence,  and  the  Zelen  exact  test  for  equal  odds  ratios. 

PROC  GENMOD  fits  generalized  linear  models  using  ML  or  Bayesian  methods.  It 
also  fits  cumulative  link  models  for  ordinal  responses  and  zero-inflated  Poisson  regression 
models  for  count  data.  It  can  perform  GEE  analyses  for  marginal  models  and  gives  ML 
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STATISTICAL  SOFTWARE  FOR  CATEGORICAL  DATA  ANALYSIS 


fitting  of  binary  response  models,  cumulative  link  models  for  ordinal  responses,  and 
baseline-category  logit  models  for  nominal  responses.  It  also  can  perform  conditional 
logistic  regression  and  small-sample  inference  using  the  conditional  likelihood.  PROC 
CATMOD  fits  baseline-category  logit  models.  It  is  also  useful  for  weighted  least-squares 
fitting  of  a  wide  variety  of  models  for  nonsparse  contingency  tables.  PROC  SURVEY- 
LOGISTIC  can  fit  binary  and  multiple-category  logistic  models  by  the  method  of  pseudo 
maximum  likelihood,  incorporating  the  sample  design  into  the  analysis. 

PROC  NLMIXED  fits  generalized  linear  mixed  models  (GLMMs).  It  approximates 
the  likelihood  using  adaptive  Gauss-Hermite  quadrature.  PROC  GLIMMIX  also  fits  such 
models  with  a  variety  of  fitting  methods,  including  pseudo  likelihood  methods,  and  pro¬ 
vides  built-in  distributions  and  associated  variance  functions  as  well  as  link  functions  for 
categorical  responses. 

Other  programs  run  on  SAS  that  are  not  specifically  supported  by  the  SAS  Institute. 
For  further  details  about  SAS  for  categorical  data  analyses,  see  the  very  helpful  guide  by 
Stokes  et  al.  (2012). 


A.2  R  AND  S-PLUS 

R  is  free  open-source  software  maintained  and  regularly  updated  by  many  volunteers.  See 
www .  r-proj  ect .  org,  at  which  site  you  can  download  it  and  find  various  documentation. 

Dr.  Laura  Thompson  has  prepared  an  excellent,  detailed  manual  on  the  use  of  S-PLUS 
and  R  to  conduct  the  analyses  shown  in  the  second  edition  of  this  book.  There  is  a  link  to 
it  at  the  text  website  (www.stat.ufl.edu/~aa/cda/cda.html).  There  are  also  links 
there  to  statisticians  who  have  online  material  using  R  for  categorical  data  analyses.  A 
useful  book  on  statistical  modeling  using  R  is  by  Aitkin  et  al.  (2009). 

Some  useful  R  functions  for  categorical  data  analysis  are: 

•  prop,  test  for  a  test  and  score  Cl  for  a  binomial  proportion 

•  chisq.test  for  the  chi-squared  test  of  independence 

•  fisher.test  for  Fisher’s  exact  test 

•  glm  for  generalized  linear  models  such  as  logistic  regression,  Poisson  regression,  and 
loglinear  models 

R  can  do  various  other  analyses  using  specialized  functions  available  in  libraries  or  from 
certain  people.  Examples  are: 

•  Functions  for  forming  score  and  other  confidence  intervals  for  proportions  and 
measures  such  as  the  difference  of  proportions  and  odds  ratio,  at  the  text  website, 
www . stat . uf 1 . edu/~aa/cda/cda . html. 

•  A  function  mph.fit  written  by  Prof.  Joseph  Lang  (joseph-lang@uiowa.edu)  for 
ML  fitting  of  the  generalized  loglinear  model  (10.10),  marginal  models,  and  the  much 
more  general  "multinomial-Poisson  homogeneous”  models  considered  in  Lang  (2004, 
2005). 

•  The  VGAM  library  (www.  stat  .auckland.ac  . nz/~yee/VGAH)  and  its  vglm  func¬ 
tion  written  by  Thomas  Yee,  which  can  fit  a  wide  variety  of  models  for  multinomial 
response  variables  and  other  types  of  discrete  data. 
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A.3  STATA 

For  examples  of  categorical  data  analyses  for  many  data  sets  in  my  text  An  Introduction 
to  Categorical  Data  Analysis,  see  the  useful  site  www .  ats  .  ucla  .  edu/s tat /examples/ 
icda  set  up  by  the  UCLA  Statistical  Computing  Center.  In  Stata,  the  programs: 

•  tabulate  can  generate  many  measures  of  association  and  their  standard  errors 

•  glm  can  fit  generalized  linear  models  such  as  logistic  regression  and  loglinear  models 

•  mlogit  can  fit  baseline-category  logit  models  and  ologit  can  fit  ordinal  models 

•  the  GLLAMM  module  (www.gllamm.org)  can  fit  a  very  wide  variety  of  models, 
including  logistic  and  cumulative  logit  models  with  random  effects 

A.4  SPSS 

On  the  Analyze  menu,  the  Descriptive  statistics  option  has  a  Crosstabs  suboption  that 
provides  several  methods  for  contingency  tables,  including  measures  of  association  and 
their  standard  errors.  The  Generalized  linear  models  option  has  a  Generalized  linear  models 
suboption,  the  Regression  option  has  a  Binary  logistic  suboption  and  an  Ordinal  suboption 
for  a  cumulative  link  model  and  Multinomial  logistic  suboption  for  a  baseline-category  logit 
model,  the  Loglinear  option  has  a  General  suboption,  and  the  Generalized  linear  models 
option  has  the  Generalized  estimating  equations  (EE)  suboption.  For  further  details  on  all 
of  the  above,  see  the  text  website. 


A .5  STATXACT  AND  LOGXACT 

The  Cytel  Software  package  StatXact  (www .  cytel .  com/Sof  tware/StatXact)  provides 
small-sample  confidence  intervals  for  a  binomial  parameter,  the  difference  of  proportions, 
relative  risk,  and  odds  ratio.  It  provides  Fisher’s  exact  test  and  its  generalizations  for  /  x  ./ 
tables  and  can  conduct  exact  tests  of  conditional  independence  in  stratified  tables  and  tests 
of  equality  of  odds  ratios,  and  can  construct  exact  confidence  intervals  for  the  common 
odds  ratio  in  several  2x2  tables.  Its  companion  LogXact  (www.  cytel .  com/Software/ 
LogXact)  performs  exact  conditional  logistic  regression  for  categorical  responses.  Their 
manuals  are  good  resources  for  summaries  of  small-sample  methods. 


A.6  OTHER  SOFTWARE 

The  text  website  also  provides  information  about  other  software,  such  as  for  HLM  and 
MLwiN  for  multilevel  models,  LATENT  GOLD  and  LEM  for  latent  class  models,  LIMDEP 
and  NLOGIT  for  multinomial  discrete-choice  models,  SLTDAAN  for  survey  data,  and 
SUPERMIX  for  generalized  linear  mixed  models. 


APPENDIX  B 


Chi-Squared  Distribution  Values 


Right-Tailed  Probability 


df 

0.250 

0.100 

0.050 

0.025 

0.010 

0.005 

0.001 

1 

1.32 

2.71 

3.84 

5.02 

6.63 

7.88 

10.83 

2 

2.77 

4.61 

5.99 

7.38 

9.21 

10.60 

13.82 

3 

4.1 1 

6.25 

7.81 

9.35 

1 1.34 

12.84 

16.27 

4 

5.39 

7.78 

9.49 

1 1.14 

13.28 

14.86 

18.47 

5 

6.63 

9.24 

1 1.07 

12.83 

15.09 

16.75 

20.52 

6 

7.84 

10.64 

12.59 

14.45 

16.81 

18.55 

22.46 

7 

9.04 

12.02 

14.07 

16.01 

18.48 

20.28 

24.32 

8 

10.22 

13.36 

15.51 

17.53 

20.09 

21.96 

26.12 

9 

1 1.39 

14.68 

16.92 

19.02 

21.67 

23.59 

27.88 

10 

12.55 

15.99 

18.31 

20.48 

23.21 

25.19 

29.59 

1 1 

13.70 

17.28 

19.68 

21.92 

24.72 

26.76 

31.26 

12 

14.85 

18.55 

21.03 

23.34 

26.22 

28.30 

32.91 

13 

15.98 

19.81 

22.36 

24.74 

27.69 

29.82 

34.53 

14 

17.12 

21.06 

23.68 

26.12 

29.14 

31.32 

36.12 

15 

18.25 

22.31 

25.00 

27.49 

30.58 

32.80 

37.70 

16 

19.37 

23.54 

26.30 

28.85 

32.00 

34.27 

39.25 

17 

20.49 

24.77 

27.59 

30.19 

33.41 

35.72 

40.79 

18 

21.60 

25.99 

28.87 

31.53 

34.81 

37.16 

42.31 

19 

22.72 

27.20 

30.14 

32.85 

36.19 

38.58 

43.82 

20 

23.83 

28.41 

31.41 

34.17 

37.57 

40.00 

45.32 

25 

29.34 

34.38 

37.65 

40.65 

44.31 

46.93 

52.62 

30 

34.80 

40.26 

43.77 

46.98 

50.89 

53.67 

59.70 

40 

45.62 

51.80 

55.76 

59.34 

63.69 

66.77 

73.40 

50 

56.33 

63.17 

67.50 

71.42 

76.15 

79.49 

86.66 

60 

66.98 

74.40 

79.08 

83.30 

88.38 

91.95 

99.61 

70 

77.58 

85.53 

90.53 

95.02 

100.4 

104.2 

1 12.3 

80 

88.13 

96.58 

101.8 

106.6 

112.3 

1 16.3 

124.8 

90 

98.65 

107.6 

1 13.1 

1 18.1 

124.1 

128.3 

137.2 

100 

109.1 

1 18.5 

124.3 

129.6 

135.8 

140.2 

149.5 
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Young,  M„  444 

Young,  S„  210,  287,382 

Yu,  J.-T.,406 

Yuan,  Y„  282,  520 

Yule,  G.  U„  45,  50,  60,  67,  109,  367, 410,  553,  624, 
634 


AUTHOR  INDEX 
Zavoina,  W.,  314,  327,628 

Zeger,  S..  443,  462-464,  467,  469,  470,  478,  479,  496, 
497,  525,  527.  544,  628,  634 
Zelen,  M„  337,  608 
Zellner,  A.,  286,  634 
Zelterman,  D.,  406 
Zeng,  L„  242 
Zermelo,  E„  445 
Zhang.  H„  282,  569,  570,  581 
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Zhang,  T„  28 
Zhang,  Z„  609 
Zhao,  H.,  527 

Zhao.  L..  464,  470.  472,  478.  479 

Zheng.  B..  222,  242,  249 

Zhu,  J„  279 

Zhu.  Y„  103 

Zocchi.  S.,  195 

ZweifeU..  70.  195.401 


Example  Index 


Abortion  attitudes,  37 1 , 441^142,  483,  500-502 
AIDS  and  AZT  use,  1 84-1 87 
AIDS  and  government  measures,  369 
Air  pollution  and  respiratory  illness,  484 
Alcohol,  cigarettes,  marijuana,  346-350,  379, 
381-382,  385,  407,  408,  480,  528 
Alligator  food  choice,  294-297,  326,  330 
Appendix  pain,  108 
Assisted  living  enrollment,  571-574 
Auto  accidents  and  seat  belts,  41,61, 350-354, 
371-372,  380 

Baseball  home  team  advantage,  437^138,  453 

Baseball  results,  436-438 

Basketball 

Kobe  Bryant  shooting,  104 
Rajon  Rondo  assists,  198 
Ray  Allen  shooting,  158,  528,  559 
Beetle  mortality,  254,  257 
Belief  in  God  by  educational  level,  77 
Belief  in  heaven,  329 
Breast  cancer  and  tamoxifen,  6 1 
Buchanan  Presidential  votes,  153,  561 
Butterfly  ballot,  153 

Cancer  and  smoking  case-control  study,  42 
Cancer  of  larynx  by  treatment,  107 
Cancer  remission  and  labeling  index,  196-198,  287 
Cannabis  use  and  mother’s  age,  324 
Carcinoma  diagnoses  by  pathologists,  432^133,  530, 
538-540,  545-546 

Cardiovascular  disease  and  teeth  brushing,  203 
Carp  malformation  and  lead  pollution,  107 
Child  respiratory  illness  and  maternal  smoking, 
476^177 

Children’s  care  for  mother,  516-519 
Cholesterol  and  psyllium,  334 


Clinical  trial  for  fungal  infections,  235-236,  529 
Coffee  purchases,  447 
Cola  taste  tests,  449 
Condom  use,  202 

Coronary  death  rates,  smoking,  and  age,  157 
Credit  scoring,  285,  290 
Crossover  drug  comparison,  450 
Crossover  trial  for  dysmenorrhea,  480 

Death  penalty,  201, 372 
Death  penalty  and  race,  48,  64,  20 1 
Death  rates  and  Simpson’s  paradox,  64 
Developmental  toxicity  study,  312-313 
Diarrhea  and  an  antibiotic,  608 
Divorce  grounds,  583 
Draft  position  and  all-star,  204 
Driving  after  consuming  alcohol,  203 
Dumping  severity,  334 
Dysmenorrhea  crossover  trial,  480 

Educational  aspirations  and  family  income,  106 
Endometrial  cancer  grade,  258,  276 
Esophageal  cancer  and  alcohol  consumption,  202 
Evapotranspiration  rates,  474^175 

Fish  hatching,  560 

Gambler’s  ruin,  487 
Gene-environment  interactions,  355 
Genomics,  282 

Gestation  length  and  infant  survival,  407 
Global  warming  attitudes,  63,  203 
Golf  putting,  103 

Government  spending,  31, 370.  450,  512 
Graduate  school  admissions 
Berkeley,  63,  245 
Florida,  218,  529 
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EXAMPLE  INDEX 


Graham  Greene  Russian  roulette,  29 
Gun-related  deaths  by  nation,  62 

Halothane  study,  630 

Happiness  and  political  ideology,  87 

Happiness  and  traumatic  events,  304—307,  309,  310 

Heart  attacks  and  aspirin  use,  37, 46,  63,  7 1 

Heart  catheterization  and  race,  62 

Heart  disease 

blood  pressure,  2 1 6,  22 1 
smoking,  63 
snoring,  1 18 

Heart  valve  operation  survival,  128-130 
Heaven  and  hell,  445,  447 
Hepatitis  capture-recapture,  530 
Home  ownership,  198 
Homicide  victim  frequency,  554-555,  557 
Homosexual  marriage  and  party  ID,  104,  402 
Homosexual  marriage  and  religious  fundamentalism, 
316-317,319 

Homosexual  sex  and  premarital  sex,  65 
Horseshoe  crab  mating,  123-127 

binary  modeling.  166-168,  170-175,  187-191,  198. 

208-213.224 
classification  tree,  582 
count  modeling,  124-127,  150,  155-156 
discriminant  analysis,  567 
generalized  additive  model,  277 
linear  probability  model,  155 

Infant  malformation  and  alcohol  consumption,  176 
Insomnia  clinical  trial,  458-459,  464,  465,  477,  484, 
511-512 

Job  satisfaction 
by  age,  56 
by  income,  106,  408 
by  race,  65 
predicting,  244 
Journal  citations,  449 

Kyphosis  risk  factors,  199,  272 

Leading  crowd  membership/attitudes,  508-509,  530 
Lung  cancer  and  smoking,  62,  63 
Lung  cancer  clinical  trial,  332 
Lung  cancer  survival,  157 

Malformation  and  alcohol  consumption,  154 
Marginal  vs.  conditional  associations,  49-50,  232, 
379-380 

Marital  status  causal  models,  213 
Medical  diagnoses,  39 
Mendel’s  theories.  19 

Mental  depression,  456-458,  463^164,  478,  502 
Mental  impairment  and  parents’  SES,  395-398 
Menu  pricing,  300 


Meta-analysis,  246-247,  507,  531 
Migration,  425,  530 
Missing  people,  244 
Motif  discovery,  283 
Movie  ratings,  448 

Multicenter  clinical  trial,  225-232,  235-236, 
505-507,  524,  529 
Multiple  sclerosis  evaluations,  448 
Murder  and  race,  61 
Murder  rates  by  gender  and  race,  64 
Myers-Briggs  personality  scales,  369-370 
Myocardial  infarction  and  diabetes,  42 1 

NBA  basketball  predicted  probabilities,  196 
Netflix  prize,  284 

Obesity  by  gender  and  time,  485 
Occupational  aspirations,  247 
Occupational  mobility,  448 
Oral  contraceptive  use,  200 
Oxford/Cambridge  boat  race,  485,  510 

Pain  after  surgery,  59 

Party  ID  and  gender.  105 

Party  ID  and  race.  105 

Penicillin  in  rabbits,  245 

Pig  farmer  survey,  48 1 

Pneumonia  infections  in  calves.  20-2 1 .  33 

Polarized  opinions,  402 

Political  party  ID  and  attitudes,  371 

Political  party  ID.  gender  and  race,  330 

Pregnancy  rates,  559 

Premarital  and  extramarital  sex.  213,  408,  426,  432, 
529 

Premarital  sex  and  birth  control,  386-387,  389-390, 
392,  409 

Presidential  voting,  108,  419,  445,  533 
2004  and  2008  elections,  4 1 3^t  1 7,  492 
Buchanan  and  butterfly  ballot,  153,  561 
election  clustering,  532,  579 
election  poll.  499-500 
stem  cell  research,  446 
Prime  minister  evaluation,  486 
Prison  rates  by  nation.  62 
Promotion  discrimination,  269 
Prostate  cancer  diagnostic  test,  61 
Protozoan  and  poison  dose,  543 
Prussian  army  and  mule  kicks.  30 
Psychiatric  diagnosis  and  prescribed  drug,  106 

Quality  of  life,  512 

Rap  music  liking,  332 
Regional  migration,  425,  430,  530 

Satisfaction  with  housing,  335 

Seat  belts  and  injury,  41 . 61, 331.  350-354 


EXAMPLE  INDEX 

Sexual  intercourse  frequency,  560 
Sexual  orientation  and  party  ID,  602-603 
Shopping  choice,  321 
Silicon  wafer  imperfections,  155 
Simpson’s  paradox 

baseball  batting  averages,  64 
death  penalty,  49-50,  64 
death  rates,  64 

graduate  school  admissions,  63,  219,  245 
Siskel  and  Ebert  movie  ratings,  448 
Skin  damage  and  leprosy,  181 
Snoring  and  heart  disease,  1 1 8 
Snowshoe  hare  capture-recapture,  503-504,  540,  559 
Sore  throat  after  surgery,  244 
Space  shuttle,  1 99 
Stem  cell  research,  446 
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Taxes  for  environment,  37 1 
Tea  tasting,  91, 607 
Tennis  results,  450 

Teratology  overdispersion,  151-152,  550-552 
Titanic  survival  and  gender,  62 
Toenail  infection  clinical  trial,  485,  531 
Traffic  deaths  and  seat-belt  use,  70 
Trauma  patient  survival,  261-263 
Travel  credit  card,  203 

Urn  sampling  in  clinical  trial,  98 

Vegetarianism,  26,  605 


World  Cup  odds,  62 


Subject  Index 


Adding  constants  to  cell  counts,  15,  32,  70,  101,  401, 
617 

Adjacent-categories  logit  model,  309-3 1 1 
Adjusted  response  variable,  147,  161 
Agreement,  432^136,  453,  538-540,  545-546 
Agresti-Coull  confidence  interval,  33 
A1C, 212-213 
lasso  special  case,  274 
Alternating  logistic  regressions,  470 
Amalgamation  paradox,  60 
Ancillarity,  616 
Arc  sine  transformation,  618 
Association  factor,  55,  84 
Association  measures,  43-47,  54-60,  67-68, 

623-625 

Association  models,  386-398,  405 
row  and  column  effects,  391-392,  394-395 
Asymptotics 

delta  method,  72-75,  587-591 
higher-order,  27,  103 
Attributable  risk,  66 
Average  causal  effect,  191 

Backward  elimination,  209-2 1 1 
Bagging,  575 

Baseline-category  logit  model,  293-299 
adjacent  category  logits,  309-3 10 
Bayesian  fitting,  325-326 
conditional  independence,  3 15-317 
discrete-choice  model,  321 
exponential  family,  336 
likelihood  function,  298 
matched  pairs,  424 
matched  sets,  442 
random  effects,  5 1 4-5 1 5 
references,  326 
sufficient  statistics,  298,  336 


Bayesian  inference 

binary  regression  models,  257-265,  286 
CDA  history,  633-634 
comparing  proportions,  96-99 
equal-tail  interval,  23 
GLMs,  142 

highest  posterior  density  interval,  23,  98 
introduction,  22-24 
loglinear  models,  401-404 
marginal  models,  525 
model  averaging,  215 
model  checking,  265 
multinomial  models,  323-326,  329 
multivariate  responses,  523-525 
posterior  interval,  23 
two-way  tables,  96-100,  103 
Bernoulli  trials,  5 
correlated,  3 1 , 549-550,  562 
Best  asymptotically  normal  (BAN),  610,  629 
Beta  distribution,  24-25 

prior  for  binary  regression,  260-262 
prior  for  binomial  parameter,  24 
prior  for  comparing  proportions,  97-99 
Beta-binomial  models,  548-552,  558 
beta-binomial  distribution,  31,  548-550 
Between-subject  effects,  423,  455,  486,  497,  545 
Bias  of  reasonable  discrete  tests,  1 12 
Bias  reduction,  195 

generalized  linear  models,  195 
logistic  regression,  195,  275-276 
Bias/variance  tradeoff,  270-271 
classification  trees,  573,  575 
discriminant  analysis,  576 
estimation,  85 
kernel  smoothing,  271 
penalized  likelihood,  274 
BIC,  212-213,  404 
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SUBJECT  INDEX 


Binomial  distribution,  5 

Bayesian  inference,  24—25,  35 
confidence  intervals,  14—16,  32-33,  620 
exponential  family  form,  115,  132 
inference,  13-17 
likelihood  function,  9 
likelihood-ratio  test,  13 
moment  generating  function,  32 
properties,  5-6 
score  test,  1 3 

small-sample  inference,  16-17 
Binomial  GLMs,  117-122 
deviance,  138 
likelihood  equations,  134 
Biserial  correlation,  60 
Bonferroni  multiple  comparisons,  75 
binomial  parameters,  101 
discrete  adjustment,  101,  281 
multinomial  parameters,  34 
multiple  testing,  279-280 
Bradley-Terry  model,  328,  436-439,  445 
quasi-symmetry  model,  438 
sufficient  statistics,  454 
Brandt-Snedecor  formula,  178 
Breslow-Day  test,  242 

Calibration,  204 

Canonical  correlation  model,  396,  627 
Canonical  link  function,  114,  133,  147-148 
Capture-recapture  modeling,  503,  505,  526,  540-541, 
559 

Case-control  study,  42-43 
logistic  regression,  168-169,  195,628 
matched  pairs,  42 1 — 423 
odds  ratio  estimation,  46,  627 
Cauchit  link  function,  264 
Chi-squared  distribution,  8 

moment  generating  function,  34 
partitioning,  81-84 
properties,  8,  27 
Chi-squared  test 

adequacy  of  approximation,  77,  101,  400-401 
derivation,  22,  596-598 
independence,  75-86,  625-626 
logistic  goodness-of-fit,  172-174 
loglinear  goodness-of-fit,  348,  359-360 
multinomial  goodness-of-fit,  18-22 
noncentral  distribution,  239-241, 598 
power,  239-241 
sparse  data  asymptotics,  406 
Classification,  581 

discriminant  analysis,  565-570 
high  dimensions,  285,  569 
logistic  regression,  223-224 
multiple  categories,  568-569 
tree-structured,  570-576 
Classification  table,  223-224,  568 


Classification  tree,  570-576 
vs.  logistic  regression,  574-576 
Clopper-Pearson  confidence  interval,  603-605, 
620-621 

Cluster  analysis,  576-581 
Cluster  sampling,  513-514 
Clustered  data,  455-533 
contingency  tables,  101 
Cochran’s  Q ,  443,  454,  629 
Cochran-Armitage  trend  test,  90,  178-179,  196,  206 
Cochran-Mantel-Haenszel  test,  227-229 
generalized,  317-319,  328,  443 
McNemar  test  connection,  4 1 7-4 1 8 
Cochran,  William,  contributions  to  CDA,  629 
Collapsibility,  53-54 

difference  of  proportions,  54 
odds  ratio,  53,  60,  67,  232,  379-380 
relative  risk,  54 
Comparing  measures,  620 
Comparing  models,  207-215 

deviances,  138-139,  248,  383,410 
Pearson  statistic,  140,  383 
sparse  data,  400-401 
Complementary  log-log  model,  255-257 
continuation  ratios,  327 
ordinal  response,  308 
Complete  separation,  234-237,  275-276 
Complete  symmetry,  439 
Composite  likelihood,  462 
Concentration  coefficient,  68 
Concordance  index,  224,  314 
Concordant  pair,  57-59 
Conditional  independence,  51, 227-229,  374 
binary  response  tests,  225-230 
graphs,  377-380,  404-405,  536 
loglinear  model,  375 
loglinear  model  test,  349,  400 
marginal  independence,  52 
multinomial  response  tests,  314—319 
ordinal  loglinear  test,  393 
small-sample  tests,  269 
test  power,  240-24 1 
Conditional  inference 

contingency  tables,  90-96,  601-609 
controversy,  94,  102 
logistic  regression,  265-270 
Conditional  logistic  regression,  265-270,  286,  607 
binary  matched  pairs,  418-423 
matched  case-control,  42 1 , 423 
Conditional  logit  model,  see  Discrete-choice  model 
Conditional  ML  estimation,  420,  607-609,  628 
between-cluster  effect,  423 
odds  ratio,  608 

random  effects  comparison,  493,  526 
Conditional  symmetry,  444 
Confidence  intervals 

inverting  tests,  12-13,  78 
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profile  likelihood,  79-80 
score-test-based.  14,  78-79 
simultaneous,  75 
small-sample.  603-609 
Confounding.  47,  67 

Conjugate  mixture  models,  552-553,  558 
Conservatism  in  discrete  inference.  17,  93-96.  270, 
603-609 

Constraints  on  parameters.  175-176.  341 
Contingency  coefficient,  1 10.  624 
Contingency  tables,  37-1 12 
confidentiality,  407 
origin.  624 

standardization,  367-368 
Continuation-ratio  logit  model,  31 1-313 
survival  model,  327 
Continuity  correction.  27.  102 
Continuous  proportions,  162 
Correlation 
exchangeable,  462 
models,  395-396.406,412 
predictive  power,  221 
test,  87,318 

working,  clustered  data,  462 
Correspondence  analysis,  396-398,  406,  627 
Cramer’s  V.  1 10 
Credible  interval,  23 
Credit  scoring.  285 

Cross-classification  table,  see  Contingency  tables 
Cross-product  ratio,  see  Odds  ratio 
Cross-validation,  223,  249,  273,  567, 568 
Mold,  274 

Cumulant  function,  132 
Cumulative  link  model.  308-309,  338 
Bayesian  fitting.  323 
dispersion  effects.  328 
Cumulative  logit  model.  301-307.  628 
Bayesian  fitting,  323 
conditional  independence,  315-317 
matched  pairs.  425-426 
matched  sets,  442 
random  effects,  511-514 
references.  327 
sample  size  and  power.  327 
Cumulative  odds  ratio.  302 
uniform  association,  337 
Cumulative  probit  model,  308.  309.  628 
Bayesian  fitting.  323-324 
references.  327 

Data  mining.  570 
Decomposable  model.  359,  368 
existence  of  ML  estimates.  399 
graphical  models  contain.  378 
Degrees  of  freedom 

effect  of  estimating  parameters,  20.  76.  625 
moments  of  chi-squared.  8 


Delta  method,  72-75,  587-591, 615 
Delta,  two  ordinal  distributions,  58 
Dendrogram.  578 
Dependent  proportions 
clustered  data.  455 
increased  precision.  4 1 6 
matched  pairs.  414-454 
Deviance,  116,  137 
binomial  GLMs,  138,  139 
comparing  models,  139.  187 
goodness  of  fit,  136-137 

grouped  vs.  ungrouped  binary  data,  138,  139,  155. 

172,205 

information  criterion  (D1C),  265 
Poisson  GLM,  137,  139 
residual,  141,216 
Dfbeta,  220 

Difference  of  proportions 
Bayesian  inference,  97-99 
chi-squared  test,  78 
confidence  interval,  71,  101 
matched  pairs,  4 1 4-4 1 7 
score  confidence  interval,  78-79,  415 
score  test,  78-79,  4 1 5 
small-sample  confidence  interval,  609 
small-sample  test,  93-96,  416-417 
standard  error,  71,414 
Differential  item  functioning,  242 
Dirichlet  distribution,  25,  28 
Dirichlet-multinomial  distribution,  558,  563 
Discordant  pair,  57-59 
Discrete-choice  model.  320-323,  628 
ordered  categories,  322-323 
Discreteness 

complications,  1 7,  93-96,  270,  603-609 
Discriminant  analysis,  565-570.  581 
diagonal.  568,  569 
quadratic,  204,  568 
vs.  logistic  regression.  569-570.  581 
Dispersion  effects,  328 
Dispersion  parameter.  127.  131.553 
Dissimilarity  index.  352-353 
Dissimilarity,  clustering,  577-578 
Diversity,  68,  618 

Ecological  diversity.  618 
Ecological  inference.  526 
Effect  modifier.  53 
EM  algorithm.  520-521. 5.37-5.38 
Empirical  Bayes,  100.  1 12 
Empirical  logit.  165.  195 
Entropy.  56 

Equiprobability.  340,  359 
Exact  inference 

conditional.  90.  270.  601-609.  616 
conditional  vs.  unconditional.  94.  102 
logistic  regression.  267-270.  607-609 


708 


SUBJECT  INDEX 


Exact  inference  ( Continuted ) 

testing  conditional  independence,  269 
testing  independence,  90-96,  601-603,  616, 
620 

unconditional,  93-96,  102,609 
Exponential  dispersion  family,  130-132 
multivariate,  336 
Exponential  family,  114,  131 
Extreme  value  distribution 
cdf  for  log-log  link,  256 
latent  for  complementary  log-log,  308 
utility  and  discrete  choice,  322 
utility  and  multinomial  logit,  299,  322,  323 
utility  for  logistic  model,  286 

False  discovery  rate,  280-281,  287 
Firth  penalized  likelihood,  275-276 
Fisher  scoring,  144-148,627 
Fisher's  exact  test,  90-96,  102,  268,  626 
Bayesian  comparison,  97 
Fisher,  R.  A.,  contributions  to  CDA,  625-627 
Freeman-Tukey  statistic,  34,  615 
Fuzzy  inference,  6 1 6 

Gambler’s  ruin  problem,  487 
Gamma  (ordinal  measure),  57-59,  579 
inference,  88,  6 1 8-6 1 9 
Yule’s  Q ,  67.  109 
Gauss-Hermite  quadrature,  520 
GEE,  see  Generalized  estimating  equations 
GEE2,  479 

Generalized  additive  model,  124,  165.  276-277 
multinomial  response,  287 
Generalized  CMH  statistic,  328,  443 
Generalized  estimating  equations,  462-473,  486 
GEE2,  479 

working  correlations,  462 
Generalized  linear  mixed  models,  490-533 
Bayesian,  523-525 

Generalized  linear  models,  1 13-162,  628 
binary  data,  1  17-122,  163-265 
count  data,  122-130,552-557 
covariance  matrix,  135-136 
likelihood  equations,  133-135,  145,  148 
link  function,  114,  133 
multivariate,  299 

quasi-likelihood.  149-152,468—470 
random  component,  1 14 
sufficient  statistics,  148 
systematic  component,  114.  132 
Generalized  loglinear  model,  393-394 
marginal  models,  460 
Genomic  applications,  282-284 
Geometric  distribution,  621 
Gibbs  sampling 

cumulative  probit  model,  324 
multinomial  probit  model,  325 


GLMM,  see  Generalized  linear  mixed  models 
Globa]  odds  ratio.  393,  406,  478 
Goodman  and  Kruskal  lambda,  68 
Goodman  and  Kruskal  tau,  68,  110 
Goodman,  Leo,  contributions  to  CDA.  630 
Goodness  of  fit 
deviance.  136-137 
Hosmer-Lemeshow  test,  173 
likelihood-ratio  statistic.  18,  19 
likelihood-ratio  test.  186,  597 
logistic  mode],  172-174 
loglinear  model.  348,  359-360 
Pearson  statistic,  18-22,  596 
Graphical  models.  378-379,  404-405 
Grouped  data 

deviance  different  from  ungrouped,  172.  205 
grouped  vs.  ungrouped  binary  data,  138,  139,  159, 
205,216, 220, 223 

Hat  matrix.  141,220,596 
Hazard  function,  1 60,  327 
Hessian  matrix,  143 
Heterogeneity,  126 

multicenter  clinical  trials,  505-507 
Hierarchical  Bayes,  100.  402.  525,  633 
Hierarchical  models.  515-519 
High-dimensional  data,  278-285.  569.  579 
Highest  posterior  density  intervals,  23.  98-99 
Homogeneity.  40 
odds  ratios,  232-233,  242 
Homogeneous  association.  53 
linear-by-linear,  392,  41 1 
loglinear  model.  344-348,  350,  375 
small-sample  test,  607-608 
symmetric  property,  67 
Hosmer-Lemeshow  test,  1 73 
Hypergeometric  distribution,  91 
mean  and  variance,  227 
multivariate,  268,  328.  601-603 
noncentral,  268,  606 

Incomplete  table,  398 
Independence,  40 
2x2  tables,  78 
Bayesian  testing,  99 
chi-squared  tests,  75-86,  625-626 
conditional,  51,  343-344 
joint,  343 

loglinear  model,  339-340,  343-344,  357,  374 
marginal  homogeneity,  453 
marginal  vs.  conditional,  5 1 , 344 
mutual,  343 

plus  marginal  homogeneity,  109 
score  statistic,  76,  41 1 

Independence  from  irrelevant  alternatives,  320, 

322 

Indicator  variables,  175-176 


SUBJECT  INDEX 


709 


Infinite  estimates 

finite  with  penalized  likelihood,  275 
logistic  regression,  233-237,  242,  249,  275 
loglinear  models,  399^102 
Influence  diagnostics,  220-221,  241 
Information  matrix,  9-1 1,  24,  135 
observed  vs.  expected,  145,  148,  154,  193 
Interaction 

loglinear  three-factor,  346,  350,  352 
loglinear  two-factor,  34 1 
no  three-factor,  344 
odds  ratio  definition,  53 
Interval  variable,  2 
Intraclass  correlation,  444 
Isotropic  table,  410 
Item  response  models,  492^193,  627 
Iterative  proportional  fitting,  365-368 
Iterative  reweighted  least  squares,  147,  153,  195,365 

Jaccard  index,  577 
Jeffreys  prior  distribution,  24 
binary  regression,  258 
binomial  parameter,  24,  28 
comparing  proportions,  97 
Firth  penalized  likelihood,  275 
multinomial  parameters,  25 
Joint  independence,  374 
Joint  response  models,  526 

Kappa  agreement  measure,  434-436,  444,  453 
Kendall’s  tau,  58 
Kernel  smoothing,  271-273,  287 
multinomial  data,  291 
Kullback-Leibler  distance,  212,  614 

Lagrange  multiplier  test,  see  Score  test 
Laplace  approximation,  521 
Large-sample  distribution  theory,  10-12,  592-600 
likelihood-ratio  statistic,  597-598 
logit/loglinear  models,  599-600 
model  parameter  estimates,  592-593 
non-normal,  619 
Pearson  statistic,  22,  596 
probability  estimates,  593-595 
residuals,  595-596 
Lasso,  274-275,  279,  287 
Latent  class  models,  535-541,  561 
Latent  variable,  122 

Bayesian  modeling,  263,  323 
hierarchical  model,  516 
latent  class  models,  535-541,  557,  561 
multinomial  models,  299,  303,  308-309 
probit  model,  252 

proportional  odds  structure,  303-304 
LD50,  165,  195  ,  627 
Leverage,  141 

Likelihood-ratio  confidence  interval,  1 2 


Likelihood-ratio  test,  1 1 
comparing  deviances,  187 
comparing  models,  138-139 
independence  in  two-way  table,  76 
theoretical  justification,  597-598 
Linear  logit  model,  177-182,  217 
efficiency  and  scoring,  1 96 
log  likelihood,  206 
score  test,  179,  206 

small-sample  conditional  inference,  268 
Linear  probability  model,  1 17-1 18,  178,  566 
likelihood  equations,  290 
Linear  trend 

two-way  table,  86-90,  178,  387,  629 
Linear-by-linear  association  model,  387-391,  405 
heterogeneous,  393,  41 1 
homogeneous,  392,  400,  4 1 1 
isotropy,  4 1 0 
Link  function,  1 14 

I  inverse  cdf,  160,  252,  264 
canonical,  114,  133,  147-148 
complementary  log-log,  255,  308 
generalized,  286,  327 
identity,  114,  117,  121,  128,419 
inverse  cdf,  121,  308 
log,  115 
log-log,  256 
logit,  115,  120,627 
probit,  121,  308 
Local  odds  ratio,  54,  336 
large-sample  distribution,  619 
uniform  association,  338,  387 
Logistic  distribution,  121 

l  distribution  approximation,  122,  264,  324 
Logistic  regression,  119-122,  163-249 
2x2  tables,  120 

adjacent-categories  logits,  309-31 1 
autoregressive  structure,  475^177,  509 
baseline-category  logit,  293-299 
Bayesian  fitting,  257-265,  286 
case-control  studies,  168-169,  282,  421^123, 

628 

categorical  predictors,  175-189 
collapsibility,  232,  379 

conditional,  265-270,  286,  418^423,  607-609 
covariance  matrix,  193-194 
design,  195,  286 
diagnostics,  215-224 
extreme-value  utility,  286 
goodness  of  fit,  171-174 
history,  627-628 
imbalance  of  outcomes,  208,  242 
implied  by  normal  explanatory  variables,  169,  204, 
566 

infinite  estimates,  233-237,  275-276 
likelihood  equations,  193,  206 
linear  logit  model,  177-182 
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Logistic  regression  ( Continued ) 

loglinear  model  connection,  353-356 
marginal  model,  4 1 8^4 1 9,  424^426,  440-442, 
457^47 1 

model  fitting,  192-195 
model  selection,  207-215 
nonparametric  random  effects,  542-543 
parameter  interpretation,  163-169 
random  effects  and  marginal  model  both  logistic, 
496, 526 

random  effects  models,  491-533 
retrospective  studies,  168-169,  195, 

421^423 

small-sample  inference,  267-270,  607-609 
subject-specific,  419,  489-533 
Logistic-normal  model,  494-5 1 1 
Logit,  25,  73,  627 
bias,  195 

confidence  interval,  109 
history,  627 
standard  error,  73 

Logit  models,  see  Logistic  regression 
Logit-normal  distribution,  25,  162 
Loglinear  model,  1 15 
Bayesian  fitting,  401-404 
collapsibility,  379-380 
complex  sampling  designs,  363-364 
conditional  independence,  343,  375 
count  response  data,  123-130 
covariance  matrix,  360 
fitting,  356-368 
four-way  tables,  350-353,  375 
generalized,  393-394 
goodness-of-fit  test,  348,  359-360 
hierarchical,  34 1 
history,  629-632 
homogeneous  association,  350 
independence  in  two-way  table,  130,  339-340,  357, 
372-374 

inference,  348-350 
infinite  estimates,  399^400 
joint  independence,  343 
large-sample  theory,  599-600 
likelihood  equations,  356-359 
logistic  model  connection,  353-356 
model  selection,  380-385 
multinomial,  342,  361-362 
mutual  independence,  343,  374,  410 
no  three-factor  interaction,  344,  350 
parameter  constraints,  341 
probability  estimates,  362 
random  effects,  563 
saturated,  340-341,  345,  356 
three-way  tables,  342-350 
Loglinear  models 

large-sample  theory,  619 
Lowess,  277 


Machine  learning,  570,  581 
Mantel-Haenszel  effect  estimates,  229-232 
Mantel,  Nathan,  contributions  to  CDA,  628 
Marginal  homogeneity 
T- way  tables,  440-443 
binary  data,  4 1 4—4 1 8 
implied  by  symmetry,  427 
independence,  453 
matched  sets,  439-443 
quasi-symmetry  connection,  429^430,  440 
Marginal  likelihood,  519 
Marginal  models,  4 1 8 

approximate  relation  with  random  effects  models, 
496-497 

binary  matched  pairs,  452 
GEE  fitting,  462^473,  478^479 
ML  fitting,  456^462,  478 
multiway  tables,  440-443 
nominal  matched  pairs,  424^425 
ordinal  matched  pairs,  425^426 
random  effects  models  comparison,  495^498 
square  tables,  4 1 8,  424^426 
vs.  transitional  models,  for  matched  pairs,  477 
Marginal  symmetry,  440 

Marginal  vs.  conditional  associations,  52,  344,  409 
Marginally-specified  model,  527 
Market  basket  data,  576 
Markov  chain  Monte  Carlo,  23,  257,  524 
Markov  chains,  473^477,  479,  487 
Matched  pairs,  4 1 3-^454 
subject-specific  model,  4 1 8-423 
Bayesian  inference,  523 
bivariate  binary  response,  508-509 
marginal  model,  418 
McNemartest,  415 
random  effects  model,  491^492 
McNemar's  test,  4 1 5^4 1 7,  424,  427 
Cochran-Mantel-Haenszel  test  connection, 
417^418 

crossover  study,  446 
generalized,  443 
paired  t  test,  45 1 
Measurement  error,  368 
Median  effective  level,  165 
Meta-analysis,  53,  230-233,  507,  526 
Bayesian,  523 
MidF-value,  17,33,  103 
confidence  intervals,  605 
Fisher’s  exact  test,  93 
Mid  distribution  function,  33 
Midranks,  89 
Minimax  estimate,  28 

Minimum  chi-squared  estimation,  613-616,  622 

Minimum  discrimination  information,  34,  614-616 

Misclassification  error,  242,  368 

Missing  at  random,  47 1 

Missing  completely  at  random,  471 
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Missing  data,  47 1 , 479 
clustered  data,  47 1—473,  486-487 
two-way  tables,  102 
Mixed  logit  model,  322 
Mixed-membership  model,  557 
Mixture  models,  535-564 
beta-binomial,  548-552 
latent  class,  535-541 
logistic-normal,  494-5 1 1 , 562 
negative  binomial,  553,  563 
nonparametric  random  effects,  548 
Rasch  mixture,  545 
Model  misspecification 
GEE  methods,  463,  479 
Model  selection,  278-281 
logistic  models,  207-215 
loglinear  models,  380-385,  398,  405 
Model  smoothing,  182,  270-278,  594 
Monte  Carlo  methods,  23,  257,  520-521, 524,  602 
Mosaic  plot,  8 1 
Multicollinearity,  208-209 
Multilevel  models,  515-519 
Multinomial  distribution,  6 

Bayesian  inference,  25-28,  323-326 
correlation  structure,  6,  32,  590,  618 
inference,  17-22 
likelihood  function,  17 
likelihood-ratio  statistic,  18,  19,  34 
multiple  comparisons,  28,  34 
Pearson  goodness-of-fit  statistic,  1 8-22 
Poisson  connection,  7,  361-362 
properties,  6 

Multinomial  logit  models,  293-338 
Multinomial  Poisson  homogenous  model,  394 
Multinomial  probit  model,  299-300 
Bayesian  fitting,  325 
discrete  choice,  321-322 
Multinomial  sampling,  40 
independent,  4 1 ,  67 
product,  4 1 

Multiple  comparisons,  75 

Bonferroni  method,  75,  279-280 
false  discovery  rate,  280-28 1 
loglinear  models,  384 
multinomial  parameters,  28,  34 
odds  ratios,  75 
proportions,  101 

Multiple  correspondence  analysis,  398 
Multiple  imputation,  472 

Multivariate  hypergeometric  distribution,  601-603, 
620 

Mutual  independence,  374,  375,  410 

Natural  exponential  family,  152 
Nearest  neighbors 
classification,  569 
smoothing,  272-273,  284,  287 


Negative  binomial  distribution,  127,  558 
exponential  family,  161 
mode,  553 

no.  successes  before  k  failures,  32 
Poisson  connection,  32,  127,  553,  563 
variance  proportional  to  mean,  563 
Negative  binomial  GLMs,  127,  552-557,  560-561 
Nested  logit  model,  322 
Newton-Raphson  method,  143-148 
logistic  regression,  194-195 
loglinear  model,  364-367 
Neyman  modified  chi-squared,  34,  613 
No  three-factor  interaction,  375 
Nomina]  variable,  2 

conditional  independence  tests,  315-319 
modeling,  293-300,  320-322 
Noncentral  chi-squared  distribution,  180-181,  243, 
598-599,615 
noncentrality,  239-241 

O,  o  rates  of  convergence,  588 
Observational  study,  43 
Occupational  mobility,  444 
Odds  ratio,  44-47 
/  x  J  tables,  54 
asymptotic  distribution,  69,  591 
Bayesian  inference,  97-99 
confidence  interval,  70,  79,  606 
global,  393, 406,  478 
history,  624 

homogeneity,  232-233,  242,  627 
local,  54,  336,  387 
logistic  regression,  164 
Mantel-Haenszel  estimate,  229-232 
properties,  45^17,  60,  66 
relative  risk  approximation,  47,  66 
small-sample  confidence  interval,  606-607 
standard  error  of  log,  70,  75 
working  association,  GEE,  470 
Yule’s  Q  connection,  67 
Offset,  128 

Ordered  logit  model,  301-307,  309-313 
Ordered  probit  model,  308-309 
Ordinal  data 

adjacent-categories  logit  model,  309-3 1 1 
comparing  two  distributions,  58-59,  68,  90 
concordant  and  discordant  pairs,  57 
conditional  independence  tests,  315-319,  393 
continuation-ratio  logit  model,  31 1-313 
cumulative  link  models,  308-309 
cumulative  logit  models,  301-307,  309 
cumulative  probit  model,  308-309 
discrete-choice  models,  322-323 
independence  test,  86-90 
loglinear  models,  386-393 
power  advantage,  88,  179-181 
trend  test,  178-182,  316,  391 
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Ordinal  quasi-symmetry,  431—432,  444 
Ordinal  response,  526 
Ordinal  variable,  2 
Ordinary  least  squares 

linear  discriminant  analysis,  566 
linear  probability  model,  1 18 
ordinal  response,  327 
Outlier,  220 

Overdispersion,  7,  126-127 
beta-binomial  models,  548-552 
binomial  GLMs,  150-152,  562 
impossible  with  Bernoulli,  562 
multinomial  GLM,  313 
Poisson  GLM,  126-127,  149-150,  552 

P-value 

mid,  see  Mid  P- value 
randomized,  28 
two-sided,  92 

Parameter  constraints,  175-176,  341 
Parsimony 

estimating  proportions,  85-86 
model  selection,  2 1 2 
model  smoothing,  182,  270,  594 
Partial  tables,  48 
Partitioning  chi-squared,  81-84 
combining  categories,  110 
comparing  measures,  620 
loglinear  models,  384, 405 
trend  test  in  /  x  2  table,  178 
Pattern  mixture  models,  472 
Pearson  chi-squared  statistic,  18,  623 
1x2  table,  178 
2x2  tables,  78 
comparing  models,  140 
goodness  of  fit,  1 8-22,  596 
grouped  vs  ungrouped  binary  data,  159 
independence  test,  75 
moments,  76,  101 
theoretical  justification,  22,  596 
Pearson  residual,  80,  385-386 
binary  regression,  215 
binomial,  1 10 
GLM,  140 

Poisson  GLM,  141,595 
Pearson,  Karl,  contributions  to  CDA,  623-626 
Pearson-Yule  association  controversy, 
623-625 

Penalized  likelihood,  273—276,  287 
Penalized  quasi-likelihood,  521-522 
Perfect  discrimination,  234 
Perfect  table,  405 
Phi-squared,  1 10 

Plus  four  confidence  interval,  33,  101 
Poisson  distribution,  6 
exponential  family  form,  115,  132 
moment  generating  function,  32 
multinomial  connection,  7,  361-362 


negative  binomial  connection,  127 
overdispersion,  7,  149,  552 
properties,  6-7 
variance  test,  161 
Poisson  GLM,  1 15,  123-130,  152 
common  mean  model,  486 
deviance,  137 

overdispersion,  127,  149-150,552-557 
Pearson  residual,  141, 595 
random  effects  models,  555-557,  563-564 
standardized  residuals,  141,596 
Poisson  loglinear  model,  1 15,  123-130,  136 
contingency  tables,  339—412 
covariance  matrix,  136 
likelihood  equations,  136,  357 
multinomial  connection,  600 
Polychoric  correlation,  59 
Population-averaged  effect,  418,  419,  493, 
495-498 

Positive  likelihood-ratio  dependence,  410 
Positive  predictive  value,  66 
Posterior  interval,  23 
Power 

chi-squared  tests,  239-241 
comparing  proportions,  237-238 
Power  divergence  statistic,  34 
Predictive  power 

binary  regression,  221-224,  242 
linear  discriminant  analysis,  574 
ordinal  models,  314 
Prior  distribution 
beta,  24 

binary  response  probabilities,  24-25,  260 
comparing  proportions,  96 
conjugate,  24,  552 
data  augmentation,  26 1 
Dirich  let,  25,  28 
improper,  27,  257 
multivariate  normal,  257 
Probit  link  function,  121, 252,  323-626 
Probit  model,  252-255,  285 
Bayesian  fitting,  263-265,  323 
history,  626 

interpreting  effects,  252-253 
likelihood  equations,  290 
multinomial,  299-300 
ordered,  308-309 
threshold  model,  252,  253 
utility  functions,  252 

Profile  likelihood  confidence  interval,  80,  615 
capture-recapture,  504 
difference  of  proportions,  102 
odds  ratio,  80,  170,  179,  236,  400 
software,  80 
Propensity  score,  233 
Proportional  hazards  model,  308,  327 
Proportional  odds  model 

adjacent-categories  logit,  310 
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cumulative  logit,  301-304 
testing  fit,  306-307 

Proportional  reduction  in  variation,  56,  221-223 
deviance,  223 
Proportions 

confidence  intervals,  14-16,  78-79,  603-605 
continuous,  162 
difference,  44,  78-79,  414,  609 
ratio,  see  Relative  risk 

Quasi  variances,  196 
Quasi-complete  separation,  234 
Quasi-independence,  429^430,  444 
agreement  modeling,  433^434 
Quasi-likelihood  methods,  149-153 
binomial  overdispersion,  150-152,  549-552,  558 
clustered  data,  462,  465—470,  479 
Poisson  overdispersion,  149-150 
Quasi-symmetry  model,  630 
agreement  modeling,  434^444 
Bradley-Terry  model,  438 
collapsing,  454 

marginal  homogeneity  test,  429^430,  440 
matched  sets,  440 

nonparametric  logistic  connection,  546-548 

references,  444 

square  tables,  427-^432,  45  3 

R  (software),  638,  text  website 
R-squared  measures,  221-223,  242,  314 
Raking  contingency  table,  367-369 
Random  effects  models,  489-533 
autocorrelated  random  effects,  509-5 1 1 
binary  data,  498-5 1 1 
binary  matched  pairs,  421 
count  data,  555-557 
interpretations,  494 
marginal  model  comparison,  495^498 
misspecification,  543-545 
multilevel,  515-519 
multinomial,  51 1-515,  527 
non-normal  random  effect,  526 
nonnegative  marginal  correlations,  494 
nonparametric,  542-548 
parameterizations,  507-508 
predicted  random  effects,  522 
probit  link,  532-533 
Random  forest,  575 
Random  intercept  model,  491 
Randomized  test,  28,  93,  616 
Ranking  outcome  categories,  89,  323 
Rasch  mixture  model,  545-546,  558 
Rasch  model,  492^493,  5  25  ,  5  27  ,  5  3  3  ,  627 
Rate  data,  128-130,  152 
Rater  agreement,  432^436,  5  3  8-540,  545-546 
RC  model,  394-395,  405^406 
Bayesian,  406 
isotropy,  4 1 0 


Regressive  logistic  model,  476 
Regularization  methods,  274 
Relative  risk,  44 

attributable  risk  connection,  66 
Bayesian  inference,  1 12 
confidence  interval,  71,79,  609 
odds  ratio  approximation,  47 
standard  error,  7 1 
Residuals 

contingency  table,  80-81 
deviance,  141,216 
GLMs,  140-142 
loglinear  models,  385-386 
Pearson,  80,  140,  215,  385-386,  595 
references,  153 

standardized,  81,  141, 216-219,  385-386,  596 
Retrospective  studies,  42,  168-169 
logistic  regression,  195 
Ridge  regression,  274 
Ridits,  1 1 1 
ROC  curve,  224 
ordinal  response,  327 

Row  effects  loglinear  model,  391-392,  405,  410 
isotropy,  4 1 0 

S-PLUS,  638 
Sample  proportion 
admissible  estimator,  28 
binomial  parameter  inference,  13-16 
minimax  estimate,  28 
ML  estimate,  10 
Sample  size 

comparing  proportions,  237-238,  242 
logistic  regression,  238-239,  242 
power,  237-24 1 
Sampling  zero,  398^401 
Sandwich  covariance  matrix,  467^468,  471 
SAS,  637-638,  text  website 
Saturated  model,  116,  136 

loglinear,  340-341,  345,  355,  373 
Score  confidence  interval 

difference  of  proportions,  78-79,  609 
odds  ratio,  79,  607 
proportion,  14,  604,  621 
references,  27,  102 
relative  risk,  79 
Score  test,  1 1 

binary  regression,  179 
CMH  test,  227 
comparing  GLMs,  140 
difference  of  proportions,  78 
generalized  CMH  test,  319 
goodness  of  fit  of  GLM,  140 
linear  logit  model,  179 
multinomial  models,  319 
Pearson  chi-squared  statistic,  76,  140 
proportion,  13 

proportional  odds  and  Wilcoxon,  327 
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Scores 

choice  of,  88-90,  179 
Selection  models,  472 
Sensitivity,  39—40,  66,  223-224 
Sequential  logit  model,  see  Continuation-ratio  logit 
model 

Simpson’s  paradox,  50,  60,  64,  219,  374,  624 
Simultaneous  testing,  248 
Small-area  estimation,  498-500 
Small-sample  distribution  theory,  601-609 
Smoothing,  287,  633 
binary  data,  270-278 
generalized  additive  model,  276-277 
kernel,  271-273,287 
model,  594 

penalized  likelihood,  273-276,  287 
Software 

examples,  www .  stat  .  uf  1 .  edu/~aa/ 
cda/ cda  .  html,  637 
website  for  book,  4 
Sparse  contingency  table,  77,  398^101 
asymptotics,  229-230,  400^401, 406,  615 
CMH  test,  228-229 
smoothing,  273,  287 
Spatial  data,  526 
Specificity,  39-40,  66,  223-224 
Spline  function,  276 
SPSS,  639,  text  website 
Square  tables,  4 1 3-454 
Standardized  regression  coefficients,  196 
Standardized  residuals,  220,  596 
2x2  tables,  81,218-219 
binomial,  110,216-219 
GLM,  141 
independence,  8 1 
loglinear  models,  385-386 
score  statistic  for  outlier,  153 
Stata,  639,  text  website 
Statistical  vs.  practical  significance,  352-353 
StatXact,  639 

Stepwise  procedures,  209-2 1 1 ,  28 1 
Stereotype  model,  405 
Stochastic  ordering 

Bayesian  evaluation,  103 
Bayesian  probability  estimate,  329 
discrete  cdf  and  uniform,  27 
ordinal  response,  313,  410 
two  cdfs,  58 
Stochastic  process,  473 
Structural  zero,  21,  398 
Subject-specific  effect 

binary  matched  pairs,  4 1 8—420 
binary  matched  set,  493 
matched  set,  423 
random  effects  models,  495^198 
Subject-specific  table,  4 1 8 
Summarizing  measures,  620 
Supervised  vs.  unsupervised  learning,  577 
Support  vector  machines,  569,  575,  581 


Suppressor  variable,  67,  279 
Survival  model,  128-130,  152,  159 
Symmetry 

binary  data,  414 
logistic/loglinear  models,  427 
matched  sets,  439 
square  tables,  426 

t  distribution  approximation  of  logistic,  122,  264,  324 
Tetrachoric  correlation,  59,  624 
Threshold  model,  122 

nonconstant  variability,  290 
ordinal  response,  303 
Time  series,  473^176,  479,  509-5 1  1 
Tolerance  distribution,  122,  252 
Transition  probability  matrix,  473 
Transitional  models,  473^179,  541 
Tree-structured  classification,  570-576,  581 
Trend  tests,  87,  178-182,  196,  316,  391 

Uncertainty  coefficient,  56 
and  G2,  1  1 1 

Uniform  association  model 
cumulative  odds  ratio,  337 
global  odds  ratio,  393 
local  odds  ratio,  338,  387 
Uniform  interaction  model,  41 1 
Unsupervised  vs.  supervised  learning,  577 
Upper-triangular  table,  444 
Utility 

logistic  model,  286 

multinomial  probit  model,  299,  321, 325 
probit  model,  252 

Variance  stabilizing  transformations,  1 17,  618,  629 
Wald  statistic,  1 0 

aberrant  behavior  for  logistic  regression,  174 
confidence  interval,  1 2 

dependence  on  parameterization,  27,  78,  174,  205 
infinite  estimate,  235 
Weight  matrix,  147 
Weighted  kappa,  435,  444 

Weighted  least  squares,  146-147,  478,  610-614,  616, 
629 

Wilcoxon  test,  90 

cumulative  logit  model,  327 
Within-subject  effects,  416,  423,  495,  497,  526 

Yule’s  g,  67,  624-625 
asymptotic  variance,  109 
Yule,  G.  U.,  contributions  to  CDA,  624-625 

Zero  count 

effect,  15,230-232,  236,613 
infinite  estimates,  73,  101,235 
odds  ratio,  101 
sampling  zero,  398^101 
structural,  2 1 

Zero-inflated  models,  558 


WILEY  SERIES  IN  PROBABILITY  AND  STATISTICS 
ESTABLISHED  BY  WALTER  A.  SHEWHART  AND  SAMUEL  S.  WlLKS 

Editors:  David  J.  Balding,  Noel  A.  C.  Cressie,  Garrett  M.  Fitzmaurice, 

Harvey  Goldstein,  Iain  M.  Johnstone,  Geert  Molenherghs,  David  W.  Scott, 

Adrian  F.  M.  Smith,  Ruey  S.  Tsay,  Sanford  Weisherg 

Editors  Emeriti:  Vic  Barnett,  J.  Stuart  Hunter,  Joseph  B.  Kadane,  JozefL.  Teugels 


The  Wiley  Series  in  Probability  and  Statistics  is  well  established  and  authoritative.  It  covers 
many  topics  of  current  research  interest  in  both  pure  and  applied  statistics  and  probability  theory. 
Written  by  leading  statisticians  and  institutions,  the  titles  span  both  state-of-the-art  developments 
in  the  field  and  classical  methods. 

Reflecting  the  wide  range  of  current  research  in  statistics,  the  series  encompasses  applied, 
methodological  and  theoretical  statistics,  ranging  from  applications  and  new  techniques  made 
possible  by  advances  in  computerized  practice  to  rigorous  treatment  of  theoretical  approaches. 

This  series  provides  essential  and  invaluable  reading  for  all  statisticians,  whether  in  aca¬ 
demia,  industry,  government,  or  research. 

t  ABRAHAM  and  LEDOLTER  •  Statistical  Methods  for  Forecasting 
AGRESTI  •  Analysis  of  Ordinal  Categorical  Data,  Second  Edition 
AGRESTI  •  An  Introduction  to  Categorical  Data  Analysis,  Second  Edition 
AGRESTI  •  Categorical  Data  Analysis,  Third  Edition 

ALTMAN,  GILL,  and  McDONALD  •  Numerical  Issues  in  Statistical  Computing  for  the 
Social  Scientist 

AMARATUNGA  and  CABRERA  •  Exploration  and  Analysis  of  DNA  Microarray  and 
Protein  Array  Data 
ANDEL  •  Mathematics  of  Chance 

ANDERSON  •  An  Introduction  to  Multivariate  Statistical  Analysis,  Third  Edition 

*  ANDERSON  •  The  Statistical  Analysis  of  Time  Series 

ANDERSON,  AUQUIER,  HAUCK,  OAKES,  VANDAELE,  and  WEISBERG  •  Statistical 
Methods  for  Comparative  Studies 
ANDERSON  and  LOYNES  •  The  Teaching  of  Practical  Statistics 
ARMITAGE  and  DAVID  (editors)  •  Advances  in  Biometry 
ARNOLD,  BALAKRISHNAN,  and  NAGARAJA  •  Records 

*  ARTHANARI  and  DODGE  •  Mathematical  Programming  in  Statistics 

*  BAILEY  •  The  Elements  of  Stochastic  Processes  with  Applications  to  the  Natural  Sciences 
BAJORSKI  •  Statistics  for  Imaging,  Optics,  and  Photonics 

BALAKRISHNAN  and  KOUTRAS  •  Runs  and  Scans  with  Applications 
BALAKRISHNAN  and  NG  •  Precedence-Type  Tests  and  Applications 
BARNETT  •  Comparative  Statistical  Inference,  Third  Edition 
BARNETT  •  Environmental  Statistics 

BARNETT  and  LEWIS  •  Outliers  in  Statistical  Data,  Third  Edition 
BARTHOLOMEW,  KNOTT,  and  MOUSTAKI  •  Latent  Variable  Models  and  Factor 
Analysis:  A  Unified  Approach,  Third  Edition 

BARTOSZYNSKI  and  NIEWIADOMSKA-BUGAJ  •  Probability  and  Statistical  Inference, 
Second  Edition 

BASILEVSKY  •  Statistical  Factor  Analysis  and  Related  Methods:  Theory  and  Applications 
BATES  and  WATTS  •  Nonlinear  Regression  Analysis  and  Its  Applications 
BECHHOFER,  SANTNER,  and  GOLDSMAN  •  Design  and  Analysis  of  Experiments  for 
Statistical  Selection,  Screening,  and  Multiple  Comparisons 


*Now  available  in  a  lower  priced  paperback  edition  in  the  Wiley  Classics  Library. 

fNow  available  in  a  lower  priced  paperback  edition  in  the  Wiley-Interscience  Paperback  Series. 


BEIRLANT,  GOEGEBEUR,  SEGERS,  TEUGELS,  and  DE  WAAL  •  Statistics  of  Extremes: 
Theory  and  Applications 

BELSLEY  •  Conditioning  Diagnostics:  Collinearity  and  Weak  Data  in  Regression 

*  BELSLEY,  KUH,  and  WELSCH  •  Regression  Diagnostics:  Identifying  Influential  Data  and 

Sources  of  Collinearity 

BENDAT  and  PIERSOL  •  Random  Data:  Analysis  and  Measurement  Procedures,  Fourth 
Edition 

BERNARDO  and  SMITH  •  Bayesian  Theory 

BERZUINI,  DAWID,  and  BERNARDINELL  •  Causality:  Statistical  Perspectives  and 
Applications 

BHAT  and  MILLER  ■  Elements  of  Applied  Stochastic  Processes,  Third  Edition 
BH ATTACH ARYA  and  WAYMIRE  •  Stochastic  Processes  with  Applications 
BIEMER,  GROVES,  LYBERG,  MATHIOWETZ,  and  SUDMAN  •  Measurement  Errors  in 
Surveys 

BILLINGSLEY  •  Convergence  of  Probability  Measures,  Second  Edition 

BILLINGSLEY  •  Probability  and  Measure,  Anniversary  Edition 

BIRKES  and  DODGE  •  Alternative  Methods  of  Regression 

BISGAARD  and  KULAHCI  •  Time  Series  Analysis  and  Forecasting  by  Example 

BISWAS,  DATTA,  FINE,  and  SEGAL  •  Statistical  Advances  in  the  Biomedical  Sciences: 

Clinical  Trials,  Epidemiology,  Survival  Analysis,  and  Bioinformatics 
BLISCHKE  and  MURTHY  (editors)  •  Case  Studies  in  Reliability  and  Maintenance 
BLISCHKE  and  MURTHY  •  Reliability:  Modeling,  Prediction,  and  Optimization 
BLOOMFIELD  •  Fourier  Analysis  of  Time  Series:  An  Introduction,  Second  Edition 
BOLLEN  •  Structural  Equations  with  Latent  Variables 

BOLLEN  and  CURRAN  •  Latent  Curve  Models:  A  Structural  Equation  Perspective 
BOROVKOV  •  Ergodicity  and  Stability  of  Stochastic  Processes 
BOSQ  and  BLANKE  •  Inference  and  Prediction  in  Large  Dimensions 
BOULEAU  •  Numerical  Methods  for  Stochastic  Processes 

*  BOX  and  TIAO  •  Bayesian  Inference  in  Statistical  Analysis 
BOX  •  Improving  Almost  Anything,  Revised  Edition 

*  BOX  and  DRAPER  •  Evolutionary  Operation:  A  Statistical  Method  for  Process  Improvement 
BOX  and  DRAPER  •  Response  Surfaces,  Mixtures,  and  Ridge  Analyses,  Second  Edition 
BOX,  HUNTER,  and  HUNTER  •  Statistics  for  Experimenters:  Design,  Innovation,  and 

Discovery,  Second  Editon 

BOX,  JENKINS,  and  REINSEL  •  Time  Series  Analysis:  Forcasting  and  Control,  Fourth 
Edition 

BOX,  LUCENO,  and  PANIAGUA-QUINONES  •  Statistical  Control  by  Monitoring  and 
Adjustment,  Second  Edition 

*  BROWN  and  HOLLANDER  •  Statistics:  A  Biomedical  Introduction 
CAIROLI  and  DALANG  •  Sequential  Stochastic  Optimization 

CASTILLO,  HADI,  BALAKRISHNAN,  and  SARABIA  •  Extreme  Value  and  Related 
Models  with  Applications  in  Engineering  and  Science 
CHAN  •  Time  Series:  Applications  to  Finance  with  R  and  S-Plus®,  Second  Edition 
CHARALAMBIDES  •  Combinatorial  Methods  in  Discrete  Distributions 
CHATTERJEE  and  HADI  •  Regression  Analysis  by  Example,  Fifth  Edition 
CHATTERJEE  and  HADI  •  Sensitivity  Analysis  in  Linear  Regression 
CHERNICK  •  Bootstrap  Methods:  A  Guide  for  Practitioners  and  Researchers,  Second 
Edition 

CHERNICK  and  FRIIS  •  Introductory  Biostatistics  for  the  Health  Sciences 
CHILES  and  DELFINER  •  Geostatistics:  Modeling  Spatial  Uncertainty,  Second  Edition 
CHOW  and  LIU  •  Design  and  Analysis  of  Clinical  Trials:  Concepts  and  Methodologies, 
Second  Edition 


*Now  available  in  a  lower  priced  paperback  edition  in  the  Wiley  Classics  Library. 

♦Now  available  in  a  lower  priced  paperback  edition  in  the  Wiley-Interscience  Paperback  Series. 


CLARKE  •  Linear  Models:  The  Theory  and  Application  of  Analysis  of  Variance 
CLARKE  and  DISNEY  •  Probability  and  Random  Processes:  A  First  Course  with 
Applications,  Second  Edition 

*  COCHRAN  and  COX  •  Experimental  Designs,  Second  Edition 

COLLINS  and  LANZA  •  Latent  Class  and  Latent  Transition  Analysis:  With  Applications  in 
the  Social,  Behavioral,  and  Health  Sciences 
CONGDON  •  Applied  Bayesian  Modelling 
CONGDON  •  Bayesian  Models  for  Categorical  Data 
CONGDON  •  Bayesian  Statistical  Modelling,  Second  Edition 
CONOVER  •  Practical  Nonparametric  Statistics,  Third  Edition 
COOK  •  Regression  Graphics 

COOK  and  WEISBERG  •  An  Introduction  to  Regression  Graphics 

COOK  and  WEISBERG  •  Applied  Regression  Including  Computing  and  Graphics 

CORNELL  •  A  Primer  on  Experiments  with  Mixtures 

CORNELL  •  Experiments  with  Mixtures,  Designs,  Models,  and  the  Analysis  of  Mixture 
Data,  Third  Edition 

COX  ■  A  Handbook  of  Introductory  Statistical  Methods 

CRESSIE  •  Statistics  for  Spatial  Data,  Revised  Edition 

CRESSIE  and  WIKLE  •  Statistics  for  Spatio-Temporal  Data 

CSORGO  and  HORVATH  •  Limit  Theorems  in  Change  Point  Analysis 

DAGPUNAR  •  Simulation  and  Monte  Carlo:  With  Applications  in  Finance  and  MCMC 

DANIEL  •  Applications  of  Statistics  to  Industrial  Experimentation 

DANIEL  •  Biostatistics:  A  Foundation  for  Analysis  in  the  Health  Sciences,  Eighth  Edition 

*  DANIEL  •  Fitting  Equations  to  Data:  Computer  Analysis  of  Multifactor  Data,  Second  Edition 
DASU  and  JOHNSON  •  Exploratory  Data  Mining  and  Data  Cleaning 

DAVID  and  NAGARAJA  •  Order  Statistics,  Third  Edition 

*  DEGROOT,  FIENBERG,  and  KADANE  •  Statistics  and  the  Law 
DEL  CASTILLO  •  Statistical  Process  Adjustment  for  Quality  Control 

DeMARIS  ■  Regression  with  Social  Data:  Modeling  Continuous  and  Limited  Response 
Variables 

DEMIDENKO  •  Mixed  Models:  Theory  and  Applications 
DENISON,  HOLMES,  MALLICK  and  SMITH  •  Bayesian  Methods  for  Nonlinear 
Classification  and  Regression 

DETTE  and  STUDDEN  •  The  Theory  of  Canonical  Moments  with  Applications  in  Statistics, 
Probability,  and  Analysis 

DEY  and  MUKERJEE  •  Fractional  Factorial  Plans 

DILLON  and  GOLDSTEIN  •  Multivariate  Analysis:  Methods  and  Applications 

*  DODGE  and  ROMIG  •  Sampling  Inspection  Tables,  Second  Edition 

*  DOOB  •  Stochastic  Processes 

DOWDY,  WEARDEN,  and  CHILKO  •  Statistics  for  Research.  Third  Edition 
DRAPER  and  SMITH  •  Applied  Regression  Analysis,  Third  Edition 
DRYDEN  and  MARDIA  •  Statistical  Shape  Analysis 
DUDEWICZ  and  MISHRA  •  Modem  Mathematical  Statistics 

DUNN  and  CLARK  •  Basic  Statistics:  A  Primer  for  the  Biomedical  Sciences,  Fourth  Edition 
DUPUIS  and  ELLIS  •  A  Weak  Convergence  Approach  to  the  Theory  of  Large  Deviations 
EDLER  and  KITSOS  •  Recent  Advances  in  Quantitative  Methods  in  Cancer  and  Human 
Health  Risk  Assessment 

*  ELANDT-JOHNSON  and  JOHNSON  •  Survival  Models  and  Data  Analysis 
ENDERS  •  Applied  Econometric  Time  Series,  Third  Edition 

'  ETHIER  and  KURTZ  •  Markov  Processes:  Characterization  and  Convergence 
EVANS,  HASTINGS,  and  PEACOCK  •  Statistical  Distributions,  Third  Edition 


*Now  available  in  a  lower  priced  paperback  edition  in  the  Wiley  Classics  Library. 

*Now  available  in  a  lower  priced  paperback  edition  in  the  Wiley-Interscience  Paperback  Series. 


EVERITT,  LANDAU,  LEESE,  and  STAHL  •  Cluster  Analysis,  Fifth  Edition 
FEDERER  and  KING  •  Variations  on  Split  Plot  and  Split  Block  Experiment  Designs 
FELLER  •  An  Introduction  to  Probability  Theory  and  Its  Applications,  Volume  I,  Third 
Edition ,  Revised;  Volume  II,  Second  Edition 
FITZMAURICE,  LAIRD,  and  WARE  •  Applied  Longitudinal  Analysis,  Second  Edition 

*  FLEISS  •  The  Design  and  Analysis  of  Clinical  Experiments 
FLEISS  •  Statistical  Methods  for  Rates  and  Proportions,  Third  Edition 

*  FLEMING  and  HARRINGTON  •  Counting  Processes  and  Survival  Analysis 
FUJIKOSHI,  ULYANOV,  and  SHIMIZU  •  Multivariate  Statistics:  High-Dimensional  and 

Large-Sample  Approximations 

FULLER  •  Introduction  to  Statistical  Time  Series,  Second  Edition 
t  FULLER  ■  Measurement  Error  Models 
GALLANT  •  Nonlinear  Statistical  Models 
GEISSER  •  Modes  of  Parametric  Statistical  Inference 

GELMAN  and  MENG  •  Applied  Bayesian  Modeling  and  Causal  Inference  from 
Incomplete-Data  Perspectives 

GEWEKE  •  Contemporary  Bayesian  Econometrics  and  Statistics 
GHOSH,  MUKHOPADHYAY,  and  SEN  •  Sequential  Estimation 
GIESBRECHT  and  GUMPERTZ  •  Planning,  Construction,  and  Statistical  Analysis  of 
Comparative  Experiments 
GIFI  •  Nonlinear  Multivariate  Analysis 
GIVENS  and  HOETING  •  Computational  Statistics 
GLASSERMAN  and  YAO  •  Monotone  Structure  in  Discrete-Event  Systems 
GNANADESIKAN  •  Methods  for  Statistical  Data  Analysis  of  Multivariate  Observations, 
Second  Edition 

GOLDSTEIN  •  Multilevel  Statistical  Models,  Fourth  Edition 

GOLDSTEIN  and  LEWIS  •  Assessment:  Problems,  Development,  and  Statistical  Issues 
GOLDSTEIN  and  WOOFF  •  Bayes  Linear  Statistics 
GREENWOOD  and  NIKULIN  •  A  Guide  to  Chi-Squared  Testing 
GROSS,  SHORTLE,  THOMPSON,  and  HARRIS  •  Fundamentals  of  Queueing  Theory, 
Fourth  Edition 

GROSS,  SHORTLE,  THOMPSON,  and  HARRIS  •  Solutions  Manual  to  Accompany 
Fundamentals  of  Queueing  Theory,  Fourth  Edition 

*  HAHN  and  SHAPIRO  •  Statistical  Models  in  Engineering 

HAHN  and  MEEKER  •  Statistical  Intervals:  A  Guide  for  Practitioners 
HALD  •  A  History  of  Probability  and  Statistics  and  their  Applications  Before  1 750 
t  HAMPEL  •  Robust  Statistics:  The  Approach  Based  on  Influence  Functions 
HARTUNG,  KNAPP,  and  SINHA  •  Statistical  Meta-Analysis  with  Applications 
HEIBERGER  •  Computation  for  the  Analysis  of  Designed  Experiments 
HEDAYAT  and  SINHA  •  Design  and  Inference  in  Finite  Population  Sampling 
HEDEKER  and  GIBBONS  •  Longitudinal  Data  Analysis 
HELLER  •  MACSYMA  for  Statisticians 

HERITIER,  CANTONI,  COPT,  and  VICTORIA-FESER  •  Robust  Methods  in  Biostatistics 
HINKELMANN  and  KEMPTHORNE  •  Design  and  Analysis  of  Experiments,  Volume  1: 

Introduction  to  Experimental  Design,  Second  Edition 
HINKELMANN  and  KEMPTHORNE  •  Design  and  Analysis  of  Experiments,  Volume  2: 
Advanced  Experimental  Design 

HINKELMANN  (editor)  •  Design  and  Analysis  of  Experiments,  Volume  3:  Special  Design 
and  Applications 

HOAGLIN,  MOSTELLER,  and  TUKEY  •  Fundamentals  of  Exploratory  Analysis  of 
Variance 

*  HOAGLIN,  MOSTELLER,  and  TUKEY  •  Exploring  Data  Tables,  Trends  and  Shapes 


*Now  available  in  a  lower  priced  paperback  edition  in  the  Wiley  Classics  Library. 

+Now  available  in  a  lower  priced  paperback  edition  in  the  Wiley-lnterscience  Paperback  Series. 


*  HOAGLIN,  MOSTELLER,  and  TUKEY  •  Understanding  Robust  and  Exploratory  Data 
Analysis 

HOCHBERG  and  TAMHANE  •  Multiple  Comparison  Procedures 

HOCKING  •  Methods  and  Applications  of  Linear  Models:  Regression  and  the  Analysis  of 
Variance,  Second  Edition 

HOEL  •  Introduction  to  Mathematical  Statistics,  Fifth  Edition 
HOGG  and  KLUGMAN  •  Loss  Distributions 

HOLLANDER  and  WOLFE  •  Nonparametric  Statistical  Methods,  Second  Edition 
HOSMER  and  LEMESHOW  •  Applied  Logistic  Regression,  Second  Edition 
HOSMER,  LEMESHOW,  and  MAY  •  Applied  Survival  Analysis:  Regression  Modeling  of 
Time-to-Event  Data,  Second  Edition 
HUBER  •  Data  Analysis:  What  Can  Be  Learned  From  the  Past  50  Years 
HUBER  •  Robust  Statistics 

t  HUBER  and  RONCHETTI  •  Robust  Statistics,  Second  Edition 
HUBERTY  •  Applied  Discriminant  Analysis,  Second  Edition 

HUBERTY  and  OLEJNIK  •  Applied  MANOVA  and  Discriminant  Analysis,  Second  Edition 
HUITEMA  •  The  Analysis  of  Covariance  and  Alternatives:  Statistical  Methods  for 
Experiments,  Quasi-Experiments,  and  Single-Case  Studies,  Second  Edition 
HUNT  and  KENNEDY  •  Financial  Derivatives  in  Theory  and  Practice,  Revised  Edition 
HURD  and  MIAMEE  •  Periodically  Correlated  Random  Sequences:  Spectral  Theory  and 
Practice 

HUSKOVA,  BERAN,  and  DUPAC  •  Collected  Works  of  Jaroslav  Hajek — with  Commentary 
HUZURBAZAR  •  Flowgraph  Models  for  Multistate  Time-to-Event  Data 
JACKMAN  •  Bayesian  Analysis  for  the  Social  Sciences 
t  JACKSON  •  A  User’s  Guide  to  Principle  Components 
JOHN  •  Statistical  Methods  in  Engineering  and  Quality  Assurance 
JOHNSON  •  Multivariate  Statistical  Simulation 

JOHNSON  and  BALAKRISHNAN  •  Advances  in  the  Theory  and  Practice  of  Statistics:  A 
Volume  in  Honor  of  Samuel  Kotz 

JOHNSON,  KEMP,  and  KOTZ  •  Univariate  Discrete  Distributions,  Third  Edition 
JOHNSON  and  KOTZ  (editors)  •  Leading  Personalities  in  Statistical  Sciences:  From  the 
Seventeenth  Century  to  the  Present 

JOHNSON,  KOTZ,  and  BALAKRISHNAN  •  Continuous  Univariate  Distributions,  Volume 

1 ,  Second  Edition 

JOHNSON,  KOTZ,  and  BALAKRISHNAN  •  Continuous  Univariate  Distributions,  Volume 

2,  Second  Edition 

JOHNSON,  KOTZ,  and  BALAKRISHNAN  •  Discrete  Multivariate  Distributions 
JUDGE,  GRIFFITHS,  HILL,  LUTKEPOHL,  and  LEE  •  The  Theory  and  Practice  of 
Econometrics,  Second  Edition 

JUREK  and  MASON  •  Operator-Limit  Distributions  in  Probability  Theory 
KADANE  •  Bayesian  Methods  and  Ethics  in  a  Clinical  Trial  Design 
KADANE  AND  SCHUM  •  A  Probabilistic  Analysis  of  the  Sacco  and  Vanzetti  Evidence 
KALBFLEISCH  and  PRENTICE  •  The  Statistical  Analysis  of  Failure  Time  Data,  Second 
Edition 

KARIYA  and  KURATA  •  Generalized  Least  Squares 
KASS  and  VOS  •  Geometrical  Foundations  of  Asymptotic  Inference 
t  KAUFMAN  and  ROUSSEEUW  •  Finding  Groups  in  Data:  An  Introduction  to  Cluster 
Analysis 

KEDEM  and  FOKIANOS  •  Regression  Models  for  Time  Series  Analysis 
KENDALL,  BARDEN,  CARNE,  and  LE  •  Shape  and  Shape  Theory 
KHURI  •  Advanced  Calculus  with  Applications  in  Statistics,  Second  Edition 
KHURI,  MATHEW,  and  SINHA  •  Statistical  Tests  for  Mixed  Linear  Models 


*Now  available  in  a  lower  priced  paperback  edition  in  the  Wiley  Classics  Library. 

T  Now  available  in  a  lower  priced  paperback  edition  in  the  Wiley-Interscience  Paperback  Series. 


*  KISH  •  Statistical  Design  for  Research 

KLEIBER  and  KOTZ  •  Statistical  Size  Distributions  in  Economics  and  Actuarial  Sciences 
KLEMELA  •  Smoothing  of  Multivariate  Data:  Density  Estimation  and  Visualization 
KLUGMAN,  PANJER.  and  WILLMOT  •  Loss  Models:  From  Data  to  Decisions,  Fourth 
Edition 

KLUGMAN,  PANJER,  and  WILLMOT  •  Student  Solutions  Manual  to  Accompany  Loss 
Models:  From  Data  to  Decisions,  Fourth  Edition 
KOSKI  and  NOBLE  •  Bayesian  Networks:  An  Introduction 

KOTZ,  BALAKRISHNAN,  and  JOHNSON  •  Continuous  Multivariate  Distributions,  Volume 
1 ,  Sec  ond  Edition 

KOTZ  and  JOHNSON  (editors)  •  Encyclopedia  of  Statistical  Sciences:  Volumes  1  to  9  with 
Index 

KOTZ  and  JOHNSON  (editors)  •  Encyclopedia  of  Statistical  Sciences:  Supplement  Volume 
KOTZ,  READ,  and  BANKS  (editors)  •  Encyclopedia  of  Statistical  Sciences:  Update  Volume 

1 

KOTZ,  READ,  and  BANKS  (editors)  •  Encyclopedia  of  Statistical  Sciences:  Update  Volume 
2 

KOWALSKI  and  TU  •  Modern  Applied  U-Statistics 

KRISHNAMOORTHY  and  MATHEW  •  Statistical  Tolerance  Regions:  Theory, 

Applications,  and  Computation 

KROESE,  TAIMRE,  and  BOTEV  •  Handbook  of  Monte  Carlo  Methods 
KROONENBERG  •  Applied  Multiway  Data  Analysis 

KULINSKAYA,  MORGENTHALER,  and  STAUDTE  •  Meta  Analysis:  A  Guide  to 
Calibrating  and  Combining  Statistical  Evidence 
KULKARNI  and  HARMAN  •  An  Elementary  Introduction  to  Statistical  Learning  Theory 
KUROWICKA  and  COOKE  •  Uncertainty  Analysis  with  High  Dimensional  Dependence 
Modelling 

KVAM  and  VIDAKOVIC  •  Nonparametric  Statistics  with  Applications  to  Science  and 
Engineering 

LACHIN  •  Biostatistical  Methods:  The  Assessment  of  Relative  Risks,  Second  Edition 
LAD  •  Operational  Subjective  Statistical  Methods:  A  Mathematical,  Philosophical,  and 
Historical  Introduction 

LAMPERTI  •  Probability:  A  Survey  of  the  Mathematical  Theory,  Second  Edition 
LAWLESS  •  Statistical  Models  and  Methods  for  Lifetime  Data,  Second  Edition 
LAWSON  •  Statistical  Methods  in  Spatial  Epidemiology,  Second  Edition 
LE  •  Applied  Categorical  Data  Analysis,  Second  Edition 
LE  •  Applied  Survival  Analysis 

LEE  •  Structural  Equation  Modeling:  A  Bayesian  Approach 

LEE  and  WANG  •  Statistical  Methods  for  Survival  Data  Analysis,  Third  Edition 

LePAGE  and  BILLARD  •  Exploring  the  Limits  of  Bootstrap 

LESSLER  and  KALSBEEK  •  Nonsampling  Errors  in  Surveys 

LEYLAND  and  GOLDSTEIN  (editors)  •  Multilevel  Modelling  of  Health  Statistics 

LIAO  •  Statistical  Group  Comparison 

LIN  •  Introductory  Stochastic  Analysis  for  Finance  and  Insurance 
LITTLE  and  RUBIN  •  Statistical  Analysis  with  Missing  Data,  Second  Edition 
LLOYD  •  The  Statistical  Analysis  of  Categorical  Data 
LOWEN  and  TEICH  •  Fractal-Based  Point  Processes 

MAGNUS  and  NEUDECKER  •  Matrix  Differential  Calculus  with  Applications  in  Statistics 
and  Econometrics,  Revised  Edition 
MALLER  and  ZHOU  •  Survival  Analysis  with  Long  Term  Survivors 
MARCHETTE  •  Random  Graphs  for  Statistical  Pattern  Recognition 
MARDIA  and  JUPP  ■  Directional  Statistics 


*Now  available  in  a  lower  priced  paperback  edition  in  the  Wiley  Classics  Library. 

^Now  available  in  a  lower  priced  paperback  edition  in  the  Wiley-Interscience  Paperback  Series. 


MARKOVICH  •  Nonparametric  Analysis  of  Univariate  Heavy-Tailed  Data:  Research  and 
Practice 

MARONNA,  MARTIN  and  YOHAI  •  Robust  Statistics:  Theory  and  Methods 
MASON,  GUNST,  and  HESS  •  Statistical  Design  and  Analysis  of  Experiments  with 
Applications  to  Engineering  and  Science,  Second  Edition 
McCOOL  •  Using  the  Weibull  Distribution:  Reliability,  Modeling,  and  Inference 
McCULLOCH,  SEARLE,  and  NEUHAUS  •  Generalized,  Linear,  and  Mixed  Models, 

Second  Edition 

McFADDEN  •  Management  of  Data  in  Clinical  Trials,  Second  Edition 

*  McLACHLAN  •  Discriminant  Analysis  and  Statistical  Pattern  Recognition 
McLACHLAN,  DO,  and  AMBROISE  •  Analyzing  Microarray  Gene  Expression  Data 
McLACHLAN  and  KRISHNAN  •  The  EM  Algorithm  and  Extensions,  Second  Edition 
McLACHLAN  and  PEEL  •  Finite  Mixture  Models 

McNEIL  •  Epidemiological  Research  Methods 

MEEKER  and  ESCOBAR  •  Statistical  Methods  for  Reliability  Data 

MEERSCHAERT  and  SCHEFFLER  •  Limit  Distributions  for  Sums  of  Independent  Random 
Vectors:  Heavy  Tails  in  Theory  and  Practice 

MENGERSEN,  ROBERT,  and  TITTERINGTON  •  Mixtures:  Estimation  and  Applications 
MICKEY,  DUNN,  and  CLARK  •  Applied  Statistics:  Analysis  of  Variance  and  Regression, 
Third  Edition 

*  MILLER  •  Survival  Analysis,  Second  Edition 

MONTGOMERY,  JENNINGS,  and  KULAHCI  •  Introduction  to  Time  Series  Analysis  and 
Forecasting 

MONTGOMERY,  PECK,  and  VINING  •  Introduction  to  Linear  Regression  Analysis,  Fifth 
Edition 

MORGENTHALER  and  TUKEY  •  Configural  Polysampling:  A  Route  to  Practical 
Robustness 

MUIRHEAD  •  Aspects  of  Multivariate  Statistical  Theory 

MULLER  and  STOYAN  •  Comparison  Methods  for  Stochastic  Models  and  Risks 

MURTHY,  XIE,  and  JIANG  •  Weibull  Models 

MYERS,  MONTGOMERY,  and  ANDERSON-COOK  •  Response  Surface  Methodology: 

Process  and  Product  Optimization  Using  Designed  Experiments,  Third  Edition 
MYERS,  MONTGOMERY,  VINING,  and  ROBINSON  •  Generalized  Linear  Models.  With 
Applications  in  Engineering  and  the  Sciences,  Second  Edition 
NATVIG  •  Multistate  Systems  Reliability  Theory  With  Applications 
t  NELSON  •  Accelerated  Testing,  Statistical  Models,  Test  Plans,  and  Data  Analyses 
t  NELSON  •  Applied  Life  Data  Analysis 
NEWMAN  •  Biostatistical  Methods  in  Epidemiology 
NG,  TAIN,  and  TANG  •  Dirichlet  Theory:  Theory,  Methods  and  Applications 
OKABE,  BOOTS,  SUGIHARA,  and  CHIU  •  Spatial  Tesselations:  Concepts  and 
Applications  of  Voronoi  Diagrams,  Second  Edition 
OLIVER  and  SMITH  •  Influence  Diagrams,  Belief  Nets  and  Decision  Analysis 
PALTA  •  Quantitative  Methods  in  Population  Health:  Extensions  of  Ordinary  Regressions 
PANJER  •  Operational  Risk:  Modeling  and  Analytics 
PANKRATZ  •  Forecasting  with  Dynamic  Regression  Models 
PANKRATZ  •  Forecasting  with  Univariate  Box-Jenkins  Models:  Concepts  and  Cases 
PARDOUX  •  Markov  Processes  and  Applications:  Algorithms,  Networks,  Genome  and 
Finance 

PARMIGIANI  and  INOUE  •  Decision  Theory:  Principles  and  Approaches 

*  PARZEN  •  Modern  Probability  Theory  and  Its  Applications 
PENA,  TIAO,  and  TSAY  •  A  Course  in  Time  Series  Analysis 

PESARIN  and  SALMASO  •  Permutation  Tests  for  Complex  Data:  Applications  and  Software 


*Now  available  in  a  lower  priced  paperback  edition  in  the  Wiley  Classics  Library. 

iNow  available  in  a  lower  priced  paperback  edition  in  the  Wiley-Interscience  Paperback  Series. 


PIANTADOSI  •  Clinical  Trials:  A  Methodologic  Perspective,  Second  Edition 
POURAHM  ADI  ■  Foundations  of  Time  Series  Analysis  and  Prediction  Theory 
POWELL  •  Approximate  Dynamic  Programming:  Solving  the  Curses  of  Dimensionality, 
Second  Edition 

POWELL  and  RYZHOV  •  Optimal  Learning 

PRESS  •  Subjective  and  Objective  Bayesian  Statistics,  Second  Edition 
PRESS  and  TANUR  -  The  Subjectivity  of  Scientists  and  the  Bayesian  Approach 
PURI,  VILAPLANA,  and  WERTZ  -  New  Perspectives  in  Theoretical  and  Applied  Statistics 
t  PUTERMAN  •  Markov  Decision  Processes:  Discrete  Stochastic  Dynamic  Programming 
Q1U  •  Image  Processing  and  Jump  Regression  Analysis 

*  RAO  •  Linear  Statistical  Inference  and  Its  Applications,  Second  Edition 
RAO  •  Statistical  Inference  for  Fractional  Diffusion  Processes 

RAUSAND  and  H0YLAND  •  System  Reliability  Theory:  Models,  Statistical  Methods,  and 
Applications,  Second  Edition 

RAYNER,  THAS,  and  BEST  •  Smooth  Tests  of  Goodnes  of  Fit:  Using  R,  Second  Edition 
RENCHER  and  SCHAALJE  •  Linear  Models  in  Statistics,  Second  Edition 
RENCHER  and  CHRISTENSEN  •  Methods  of  Multivariate  Analysis,  Third  Edition 
RENCHER  •  Multivariate  Statistical  Inference  with  Applications 
RIGDON  and  BASU  •  Statistical  Methods  for  the  Reliability  of  Repairable  Systems 

*  RIPLEY  •  Spatial  Statistics 

*  RIPLEY  •  Stochastic  Simulation 

ROHATGI  and  SALEH  •  An  Introduction  to  Probability  and  Statistics,  Second  Edition 
ROLSKI,  SCHMIDLI,  SCHMIDT,  and  TEUGELS  •  Stochastic  Processes  for  Insurance  and 
Finance 

ROSENBERGER  and  LACHIN  •  Randomization  in  Clinical  Trials:  Theory  and  Practice 
ROSSI,  ALLENBY,  and  MCCULLOCH  •  Bayesian  Statistics  and  Marketing 
t  ROUSSEEUW  and  LEROY  •  Robust  Regression  and  Outlier  Detection 
ROYSTON  and  SAUERBREI  •  Multivariate  Model  Building:  A  Pragmatic  Approach  to 
Regression  Analysis  Based  on  Fractional  Polynomials  for  Modeling  Continuous  Variables 

*  RUBIN  ■  Multiple  Imputation  for  Nonresponse  in  Surveys 

RUBINSTEIN  and  KROESE  •  Simulation  and  the  Monte  Carlo  Method,  Second  Edition 

RUBINSTEIN  and  MELAMED  •  Modem  Simulation  and  Modeling 

RYAN  •  Modem  Engineering  Statistics 

RYAN  •  Modem  Experimental  Design 

RYAN  -  Modem  Regression  Methods,  Second  Edition 

RYAN  •  Statistical  Methods  for  Quality  Improvement,  Third  Edition 

SALEH  •  Theory  of  Preliminary  Test  and  Stein-Type  Estimation  with  Applications 

SALTELLI,  CHAN,  and  SCOTT  (editors)  •  Sensitivity  Analysis 

SCHERER  -  Batch  Effects  and  Noise  in  Microarray  Experiments:  Sources  and  Solutions 

*  SCHEFFE  •  The  Analysis  of  Variance 

SCHIMEK  ■  Smoothing  and  Regression:  Approaches,  Computation,  and  Application 
SCHOTT  •  Matrix  Analysis  for  Statistics,  Second  Edition 
SCHOUTENS  •  Levy  Processes  in  Finance:  Pricing  Financial  Derivatives 
SCOTT  •  Multivariate  Density  Estimation:  Theory,  Practice,  and  Visualization 

*  SEARLE  •  Linear  Models 

*  SEARLE  •  Linear  Models  for  Unbalanced  Data 
t  SEARLE  •  Matrix  Algebra  Useful  for  Statistics 

t  SEARLE,  CASELLA,  and  McCULLOCH  •  Variance  Components 
SEARLE  and  WILLETT  •  Matrix  Algebra  for  Applied  Economics 
SEBER  •  A  Matrix  Handbook  For  Statisticians 
t  SEBER  •  Multivariate  Observations 
SEBER  and  LEE  ■  Linear  Regression  Analysis,  Second  Edition 


*Now  available  in  a  lower  priced  paperback  edition  in  the  Wiley  Classics  Library. 

fNow  available  in  a  lower  priced  paperback  edition  in  the  Wiley-Interscience  Paperback  Series. 


f  SEBER  and  WILD  •  Nonlinear  Regression 
SENNOTT  •  Stochastic  Dynamic  Programming  and  the  Control  of  Queueing  Systems 
*  SERFLING  •  Approximation  Theorems  of  Mathematical  Statistics 
SHAFER  and  VOVK  •  Probability  and  Finance:  It’s  Only  a  Game! 

SHERMAN  •  Spatial  Statistics  and  Spatio-Temporal  Data:  Covariance  Functions  and 
Directional  Properties 

SILVAPULLE  and  SEN  •  Constrained  Statistical  Inference:  Inequality,  Order,  and  Shape 
Restrictions 

SINGPURWALLA  •  Reliability  and  Risk:  A  Bayesian  Perspective 

SMALL  and  MCLEISH  •  Hilbert  Space  Methods  in  Probability  and  Statistical  Inference 

SRIVASTAVA  •  Methods  of  Multivariate  Statistics 

STAPLETON  •  Linear  Statistical  Models,  Second  Edition 

STAPLETON  •  Models  for  Probability  and  Statistical  Inference:  Theory  and  Applications 
STAUDTE  and  SHEATHER  •  Robust  Estimation  and  Testing 
STOYAN  •  Counterexamples  in  Probability,  Second  Edition 

STOYAN,  KENDALL,  and  MECKE  •  Stochastic  Geometry  and  Its  Applications,  Second 
Edition 

STOYAN  and  STOYAN  •  Fractals,  Random  Shapes  and  Point  Fields:  Methods  of 
Geometrical  Statistics 

STREET  and  BURGESS  •  The  Construction  of  Optimal  Stated  Choice  Experiments:  Theory 
and  Methods 

STYAN  •  The  Collected  Papers  of  T.  W,  Anderson:  1943-1985 

SUTTON,  ABRAMS,  JONES,  SHELDON,  and  SONG  •  Methods  for  Meta-Analysis  in 
Medical  Research 

TAKEZAWA  •  Introduction  to  Nonparametric  Regression 

TAMHANE  •  Statistical  Analysis  of  Designed  Experiments:  Theory  and  Applications 
TANAKA  •  Time  Series  Analysis:  Nonstationary  and  Noninvertible  Distribution  Theory 
THOMPSON  •  Empirical  Model  Building:  Data.  Models,  and  Reality,  Second  Edition 
THOMPSON  •  Sampling,  Third  Edition 
THOMPSON  •  Simulation:  A  Modeler's  Approach 
THOMPSON  and  SEBER  •  Adaptive  Sampling 

THOMPSON,  WILLIAMS,  and  FINDLAY  •  Models  for  Investors  in  Real  World  Markets 
TIERNEY  •  LISP-STAT:  An  Object-Oriented  Environment  for  Statistical  Computing  and 
Dynamic  Graphics 

TSAY  •  Analysis  of  Financial  Time  Series,  Third  Edition 
TSAY  •  An  Introduction  to  Analysis  of  Financial  Data  with  R 

UPTON  and  FINGLETON  •  Spatial  Data  Analysis  by  Example,  Volume  II:  Categorical  and 
Directional  Data 

f  VAN  BELLE  •  Statistical  Rules  of  Thumb,  Second  Edition 
VAN  BELLE,  FISHER,  HEAGERTY,  and  LUMLEY  •  Biostatistics:  A  Methodology  for  the 
Health  Sciences,  Second  Edition 
VESTRUP  •  The  Theory  of  Measures  and  Integration 
VIDAKOVIC  •  Statistical  Modeling  by  Wavelets 
VIERTL  •  Statistical  Methods  for  Fuzzy  Data 

VINOD  and  REAGLE  •  Preparing  for  the  Worst:  Incorporating  Downside  Risk  in  Stock 
Market  Investments 

WALLER  and  GOTWAY  •  Applied  Spatial  Statistics  for  Public  Health  Data 
WEISBERG  •  Applied  Linear  Regression,  Third  Edition 
WEISBERG  •  Bias  and  Causation:  Models  and  Judgment  for  Valid  Comparisons 
WELSH  •  Aspects  of  Statistical  Inference 

WESTFALL  and  YOUNG  •  Resampling-Based  Multiple  Testing:  Examples  and  Methods  for 
p- Value  Adjustment 


*Now  available  in  a  lower  priced  paperback  edition  in  the  Wiley  Classics  Library. 

fNow  available  in  a  lower  priced  paperback  edition  in  the  Wiley-lnterscience  Paperback  Series. 


*  WHITTAKER  •  Graphical  Models  in  Applied  Multivariate  Statistics 

WINKER  •  Optimization  Heuristics  in  Economics:  Applications  of  Threshold  Accepting 
WOODWORTH  •  Biostatistics:  A  Bayesian  Introduction 

WOOLSON  and  CLARKE  •  Statistical  Methods  for  the  Analysis  of  Biomedical  Data, 
Second  Edition 

WU  and  HAMADA  •  Experiments:  Planning,  Analysis,  and  Parameter  Design  Optimization, 
Second  Edition 

WU  and  ZHANG  •  Nonparametric  Regression  Methods  for  Longitudinal  Data  Analysis 
YIN  •  Clinical  Trial  Design:  Bayesian  and  Frequentist  Adaptive  Methods 
YOUNG,  VALERO-MORA,  and  FRIENDLY  •  Visual  Statistics:  Seeing  Data  with  Dynamic 
Interactive  Graphics 
ZACKS  •  Stage-Wise  Adaptive  Designs 

*  ZELLNER  •  An  Introduction  to  Bayesian  Inference  in  Econometrics 
ZELTERMAN  •  Discrete  Distributions — Applications  in  the  Health  Sciences 
ZHOU,  OBUCHOWSKI,  and  MCCLISH  •  Statistical  Methods  in  Diagnostic  Medicine, 

Second  Edition 


*Now  available  in  a  lower  priced  paperback  edition  in  the  Wiley  Classics  Library. 

^Now  available  in  a  lower  priced  paperback  edition  in  the  Wiley-Interscience  Paperback  Series. 


